Using the simple fact described in the title, we prove the existence of a computational problem with implications to Machine Learning, Quantum Mechanics and Complexity Theory. We also prove P!=NP, but this claim still needs to be reviewed by experts in Complexity Theory.
Using the simple fact described in the title, we prove the existence of a computational problem with implications to Machine Learning, Quantum Mechanics and Complexity Theory. We also prove P!=NP (the solution can be verified in time polynomial in the number of bits of the input and output (NP) but the problem cannot be solved in time polynomial in the number of bits of the input and output (P)), but this claim still needs to be reviewed by experts in Complexity Theory.
In this article we will be using the words “real” as in
We can always select events with some feature without rewriting the history of events, in a standard probability space. In fact, in a continuous probability space we can select events such that a random real variable
In this article we will show that this simple but non-trivial fact has profound implications not only to Complexity Theory[1][2][3][4] but also to Machine Learning[2] and Quantum Mechanics (independently of the implications to the P vs. NP problem). Almost all (in the sense we will define in this article) real functions of a real variable cannot be computed for all practical purposes, not even approximately. But a random selection allows computations in polynomial-time complexity involving the incomplete knowledge about a real function that cannot be computed in polynomial-time complexity. This is a fundamental reason why we cannot exclude a random time-evolution: a deterministic time-evolution may exist, but it has so much complexity that it cannot be calculated for all practical purposes, not even approximately (since
Note that whenever we deal with a non-separable space, there are issues with computability
Using the simple fact mentioned above, we will also prove the existence of a computational problem (defined by a continuous probability space) whose solution can be verified in time polynomial in the numbers of bits of the input and output (NP) but cannot be solved in time polynomial in the numbers of bits of the input and output (P), when using only a deterministic Turing machine[7][8]. That is, P
The goal of this article is to define the specific problem unambiguously, and the level of mathematical details will be adjusted to that: too much mathematical detail would shift the focus from the specific problem. Less detail does not always imply less mathematical rigor. Since the present author is an expert in Physics but not an expert in Complexity Theory, we will also try to prove the P vs. NP as much as it is possible, but only as a secondary goal knowing that much work by experts in complexity theory is still required because it is likely that: 1) something went wrong in the relation described here between the specific problem and the P vs. NP problem; and/or 2) the specific problem indeed can be used to solve the P vs. NP problem, but the proof presented here is incomplete.
Complexity Theory can be studied in the context of probability theory[1][2][3][4], because many real-world problems require approximations and uncertainties not only due to the limitations of any computer (already accounted for by Complexity Theory when using only finite numbers of bits, although there is also room for improvement here) but also due to the limitations of the measuring devices of physical phenomena and limitations of the mathematical models used to approximate the real-world, for instance when dealing with real variables. Most uncertainties are not related to the computer used and do not get smaller when increasing the number of bits in the computer. In fact, Physics as a science tries to be independent of Computer Science and vice-versa, as much as possible. Probability theory is a language (or interface) that allows us to transfer a problem between two sciences (these two or others).
There are two possible approaches to errors or uncertainties[1][2][3][4]: average error (with respect to a probability measure) and maximal (except in sets of null measure with respect to a probability measure) error. It turns out that both approaches can be defined using Hilbert spaces: the average error (that is,
The average error is relevant because even if P!=NP it could still make no difference with respect to the scenario P=NP for many practical purposes, if every NP problem with a reasonable probability distribution on its inputs could be solved in polynomial time-complexity on average on a deterministic Turing machine[9][10].
Moreover, since some functions are real constant functions, the definition of a real function must be consistent with the definition of a real number. Certainly, a natural definition of a real number in the context of Complexity Theory uses a standard probability space. Non-standard probability spaces are rarely (or never) used in Experimental Physics, so it is not obvious how useful a “real number” (or “real function”) defined in a non-standard probability space could be in real-world applications. There are only countable or continuous measures (or mixed) in a standard probability space, then we can define exactly only a countable number of real numbers (usually the rationals, but not necessarily), the remaining real numbers can only be constrained to be inside an interval with a finite width, eventually very small but never zero. We should also define all real functions using a standard probability space, unless we find a fundamental reason not to do it (we will not find it in this article).
We require a continuum standard probability space, since we can always define a regular conditional probability density which implements a selection of events in such probability space[11]. For instance, this is what we do when we neglect the intrinsic computation error (due to cosmic rays and many other reasons), we define a deterministic function by selecting only certain events from a complete history of random events. It is well known since many decades that in an infinite-dimensional sphere of radius
The discrete nature of the Turing machine is certainly compatible with a continuous probability space: the number of bits of the input or output can be arbitrarily large, and it is proportional to the logarithm of the resolution of the partition of the interval
Note that there are
Consider three measure spaces
In a standard measure space it is always possible to define regular conditional probabilities[11] and to choose the probability density
The following results are valid for a random input in the interval
However, a (continuous cumulative) probability distribution does not contain enough information to unambiguously define a function. On the other hand, a real wave-function whose square is the joint probability distribution allows the definition of a unitary operator on a separable Hilbert space. A unitary operator is a random generalization of a deterministic symmetry transformation of a (countable or continuous) sample space. Any unitary operator defined by a wave-function of two continuous variables cannot be a deterministic symmetry transformation (for similar reasons that a continuous probability distribution cannot unambiguously define a function).
Since
Given an input in
From the cumulative marginal probability distribution, such that
Note that the resulting deterministic function is not necessarily invertible, because the collapse of the wave-function is irreversible (unless the symmetry transformation would be deterministic, which is excluded in this case because we have a continuous probability space of functions).
The Turing machine can be equivalently defined as the set of general recursive functions, which are partial functions from non-negative integers to non-negative integers[19]. But the set of all functions from non-negative integers to non-negative integers is not suitable to define a measure, since they form an uncountable set, in a context where the continuum is not defined. Moreover, the general recursive functions are based in the notion of computability (that the Turing machine halts in a finite time), but computability does not hold in the limit of an infinite number of input bits, thus to study such limit we need to define uncomputable functions somehow (we will use complete spaces, where Cauchy sequences always converge to an element inside the space).
On the other hand, it is widely believed (and we will show in the following) that any computational problem that can be solved by a classical computer can also be solved by a quantum computer and vice-versa. That is, quantum computers obey the Church–Turing thesis. Note that it is well known that some circuits (classical hardware) provide exponential speedups when compared with some other circuits in some functions (because the input bits can be reparametrized, this is why the time complexity of a function has an upper bound, but it is not known how to establish a lower bound; it is also consistent with the fact that the halting problem is undecidable, that is, given an arbitrary function from integers to integers and an arbitrary input, we cannot determine if the output of such function is computable or not), thus the fact that a classical Turing machine can be defined as a Quantum computer is compatible with the fact that quantum computers provide exponential speedups when compared with some classical computers in some functions.
We start by noticing that the domain of a general recursive function can be defined by a dense countable basis of a particular Hilbert space which is the (Guichardet)
Note that the input of a general recursive function is a finite number of integers, but its output is only one integer. However, any function which outputs several integers is a direct sum of functions which output one integer. The other way around is also true, once we define a vacuum state (included in the Fock-Guichardet-space), that is, a function which outputs one integer is a particular case of a function that outputs several integers where all outputs except one correspond to the vacuum state. Thus, we can consider only unitary automorphisms of the Fock-Guichardet-space.
To be able to define a measure, we make the integers correspond to the (countable) step functions with rational endpoints in the interval
Then, the limit of infinitesimal intervals is well-defined, and it is defined by an element of
In a standard probability space, there are only continuous and/or countable measures. However, these may be mixed in an arbitrary way. For a theory of Physics we could choose the best case prior measure (as we did in the previous sections), since we just want to find a prior which is consistent with the experimental data, without many concerns about alternative priors. However, in Cryptography we need guarantees that our limits are robust under arbitrary choices, so we need to assume the worst case prior measure.
The previous sections could also be made compatible with the worst case prior measure, if we had a computer capable of comparing real numbers not rational. That would be acceptable for a theory of Physics, but it would make it difficult to obtain guarantees for Cryptography.
It is also difficult to guarantee true randomness is real-world applications of Cryptography. Since probabilities only mean incomplete information, we can use Probability Theory in the context of radical determinism (where nothing is random). For Cryptography, we need the worst case prior measure, rational functions and radical determinism.
So we start by eliminating a non-standard probability measure: any probability theory is a universal language (like English or mathematical logic) to define abstract models of the objects we want to study. A standard probability theory is universal and irreducible, meaning that it has the minimal content to be considered a probability theory (in agreement with Quantum Mechanics and Experimental Physics, for instance). If the non-standard probability theory is also irreducible, then the corresponding models are equivalent, and we can use the standard version without loss of generality. This allows us to transfer models between different sciences. But often the non-standard probability theory is reducible, this means that the boundary between model and probability theory is not where it would be in the standard case and there are properties that we are attributing to the probability theory that in fact belong to the model.
Thus, we should assume a standard probability theory and leave some flexibility in the definition of the computer model and not the other way around, as it often happens in Complexity Theory where there are strict axioms for different computer models, while asymptotic limits are taken without defining the probability space, which is recipe to end up with mathematical results and questions which are hard to transfer to experimental physics and many other sciences.
In the context of radical determinism, the history of events is a non-random countable sequence of events. Thus, some events with null measure might happen, due to radical determinism. But only a countable number of such events. That means that a continuous measure is only truly continuous up to a countable number of points, this possibility is already considered by the worst case prior measure.
Consider now a boolean function of a countable infinite number of bits. The time complexity of almost all Boolean functions on n variables for a Turing machine is at least (and at most, for all Boolean functions)
What we can say is that the countable (or mixed) measure allows defining functions that eventually have polynomial time complexity (that is, not necessarily non-polynomial time complexity). This contrasts with the necessarily non-polynomial time complexity of the functions defined by the continuous measure. In the following we will show that, given any mixed prior measure, we can always redefine the problem to have a continuous prior measure such that its results cannot be reproduced by a mixed prior measure, effectively converting the worst case prior measure into the best case prior measure.
The prior measure must be mixed or continuous, to allow for the limit of an infinite number of bits of all computable functions. But we must prove that
It is not obvious if we can prove
Equivalently, a standard probability space is isomorphic (up to sets with null measure) to the interval
Thus, for the worst case prior measure if we include all subsets of the input random sample, then using two integers (which is countable) we include a countable set of deterministic Turing machines
Then, we can only prove
Given any mixed prior measure, there is an interval of the input random sample where it is continuous. We rescale to
Given any other measure which is mixed in
Then the indicator function of a
Note that a continuous prior measure admits a regular conditional probability density, which allows us to define a selection (verification) of a candidate output. The verification of a constant output is in
When using a computer to solve the problem defined in the previous section, any disjoint set of the partition is an interval. This gives us two options and two options only, either the selection of events is fully deterministic or only approximately deterministic:
The selection of events is fully deterministic. Then we impose a condition on the output
The selection of events has some randomness, as small as we want. Then we do not impose a condition on the output (eventually we impose a condition on the input
Note that a complete history of events needs to be countable, so that we can convert it into a single event (mapping complete histories of events one-to-one to the real numbers in the interval
This proof is dependent on the fact that the prior measure is continuous. If it in part continuous, and in part countable, then we can choose just the continuous part of the sample space (see the previous section). While we can use a countable part of the sample space to approximately solve a continuous problem, and a continuous part of the sample space to solve a countable problem, we cannot change the prior measure from continuous to countable or vice-versa (by the Radon-Nikodym theorem), because there is no Radon-Nikodym derivative between the two measures, since the sets of null measure are disjoint between the two measures. The prior measure defines the physical world where the computer exists, thus it cannot be removed from any complete computer model related to a physical computer.
Note that in the first paragraph of the official statement of the P vs. NP problem[7], it is stated:
To define the problem precisely it is necessary to give a formal model of a computer. The standard computer model in computability theory is the Turing machine, introduced by Alan Turing in 1936. Although the model was introduced before physical computers were built, it nevertheless continues to be accepted as the proper computer model for the purpose of defining the notion of computable function.
As for any other proof, this proof is only as good as the axioms used (that is, assumptions). The computer model used for a solid proof of the P vs. NP problem should be widely accepted as a good approximation to a physical computer for the purpose of defining the notion of computable function. We believe our computer model is accepted by most experts in Physics (as argued in the previous section). We claim that our computer model makes no more assumptions than those required by the official statement[7] (including the deterministic Turing machine), and it is as close to a physical computer as possible, by today standards. Assuming a countable prior measure is not realistic (as argued in the previous sections, for instance it would exclude an ensemble of fair coins).
However, we believe that allowing a random selection of events is even more realistic (as discussed in the previous sections, also with implications to Machine Learning and Quantum Mechanics). In the next section, we will define a selection of events which has some randomness (as small as we want) and prove that even in that case, we still have P
A selection of events which is only approximately deterministic can be approximated by a step function (step functions are dense in
The first sample from the uniform distribution defines directly
Since the wave-function is polynomial non-constant, then the corresponding cumulative probability distribution minus the second sample is strictly crescent (except in sets of null measure). Thus, when we define the corresponding deterministic function we can choose the second sample which produces an output which is as far from zero as we want (in the interval
The two random samples from a uniform distribution in the interval
Moreover, it would be better if the generation of random samples had linear time complexity, since then we could do a constant rate of experiments over time to validate the probability distribution of the random selection, otherwise it would be impractical to generate an infinite sequence of experiments.
We cannot prove this mathematically (since we would need more axioms). However, the implications of this article to Quantum Mechanics help to clarify the source of randomness of Quantum Mechanics (and thus of the random samples). It is relevant to verify empirically that the generation of random numbers with linear time complexity is possible, for all practical purposes. We can visually check on the website from ANU QRNG 4 that the number of bits of the random sample grows linearly in time and any complete history of events converges to a uniform probability distribution.
Moreover, the entropy is maximal, in the sense that the deterministic function needed to correlate the bits is not computable for all practical purposes, not even approximately (since
In the introduction we discussed the implications (of the results of this article), which are common to Machine Learning and Quantum Mechanics. But Machine Learning (for instance Deep Neural Networks) is not firmly based in probability theory, unlike Quantum Mechanics, then there are more consequences.
In Machine Learning, methods inspired by probability theory are used often[2], but the formalism is based in approximations to deterministic functions, guided by a distance (or equivalently, an optimization problem) and not a measure. In fact, two of the main open problems are the alignment of models and the incorporation of prior knowledge[23], which could be both well solved by a prior measure if there would be any measure defined.
Our results imply that under reasonable assumptions, almost all functions are not computable not even approximately. Thus, Machine Learning works because the functions we are approximating are in fact probability distributions (eventually after some reparametrization[24]). This shouldn’t be surprising, since Classical Information Theory shows (under reasonable assumptions) that probability is unavoidable when we are dealing with representations of knowledge/information[1][2]. But in Machine Learning the probability measure is not consistently defined (despite that many methods are inspired by probability theory), the probability measure emerges from the approximation[24] and often in an inconsistent way. The inconsistency is not due to a lack of computational power since modern neural networks can fit very complex deterministic functions and fail badly[25][26] in relatively simple probability distributions (e.g. catastrophic forgetting or the need of calibration to have some probabilistic guarantees[26]).
This unavoidable emergence of a probability measure should be investigated as a potential source of inefficiency, inconsistency and even danger. If the emergence of a probability measure is unavoidable, why don’t we just define a probability measure in the formalism consistently? Many people say “it is how our brain works”, so mathematics should step aside when there is empirical evidence.
But the empirical evidence is: oversized deep neural networks still generalize well, apparently because often the learning process converges to a local maximum (of the optimization problem) near the point where the learning begun[27]. This implies that if we repeat the learning process with a random initialization (as we do when we consider ensembles of neural networks[25][28]), then we do not expect the new parameters to be near any particular value, regardless of the result of the first learning process. This expectation is justified by the fact that every three layers of a wide enough neural network is a universal approximator of a function[29], so any deviation introduced by three layers can be fully corrected in the next three layers, when composing dozens or hundreds of layers as we do in a deep neural network. Then the correlation between the parameters corresponding to different local maximums converges to zero, when the number of layers increases.
Thus, there is empirical evidence that oversized deep neural networks still generalize well, precisely because a prior measure emerges: deep learning does not converge to the global maximum and instead to one of the local maximums chosen randomly, effectively sampling from a prior measure in the sample space defined by all local maximums. This is consistent with the good results achieved by ensembles of neural networks[25][28], which mimic many samples. However, it is a prior measure which we cannot easily modify or even understand, because the measure space is the set of all local maximums of the optimization problem. But, since we expect the parameters to be fully uncorrelated between different local maximums, then many other prior measures (which we can modify and understand, such as the uniform measure) should achieve the same level of generalization.
This is not a surprise, since oversized statistical models that still generalize well were already found many decades ago by many people[12]: a standard probability space with a uniform probability measure can be infinite-dimensional (the sphere[12] studied in this article, for instance).
More empirical evidence: no one looks to a blurred photo of a gorilla and says with certainty that it is not a man in a gorilla suit. We all have many doubts, when we are not sure about a subject we usually express doubts through the absence of an action (not just us, but also many animals), for instance we don’t write a book about the subject we don’t know about.
There is no empirical evidence that our brain tries to create content which is a short distance from content (books, conversations, etc.) created under exceptional circumstances (when doubts are minimal). When we are driving, and we do not know what is in front of us, we usually just slow down or stop the car. But what content defines “not knowing”? Is there empirical evidence about the unknown? The unknown can only be an abstract concept, expressed through probability theory or a logical equivalent. Is there empirical evidence that probabilities are reducible, that there is a simpler logical equivalent? No, quite the opposite.
The only trade-off seems to be between costs (time complexity, etc.) and understanding/control. A prior measure which we understand and/or control may mean much more costs than an emergent (thus, inconsistent and uncontrollable) prior measure which just minimizes some distance. But this trade-off is not new, and it is already present in all industries which deal with some safety risk (which is essentially all industries). Distances are efficient for proof of concepts (pilot projects), when the goal is to show that we are a short distance from where we want to be. But safety (as most features) is not being at a short distance from being safe5. “We were at a short distance from avoiding nuclear annihilation” is completely different from “we avoided nuclear annihilation”. To avoid nuclear annihilation we need (probability) measures, not only distances.