Thursday, November 7, 2013

Part I: Many worlds...

The many worlds of probability, reality and cognition

By Paul Conant

Where we are going
Controversy is a common thread running through the history of studies of probability assessments and statistical inference. It is my contention that so many opinions and characterizations exist because the concept roughly known as "probability" touches on the enigma of existence, with its attendant ambiguities (1).

A large and interesting literature has arisen concerning these controversies. I have sampled some of it but an exhaustive survey this is not. Nor is any formal program presented here. Rather, the idea is to try to come to grips with the assumptions on which rest the various forms of probabilistic thinking (2).

The extent of the disagreement and the many attempts to plug the holes make obvious that there is no consensus on the meaning of "probability," though students are generally taught some sort of synthesis, which may give an impression of consensus. And it is true that in general working scientists and statisticians adopt a pragmatic attitude, satisfying themselves with the vague thought that the axioms of Richard Von Mises [see below; use Control f with mises ax] and Andrey Kolmogorov and the work of Bruno De Finetti or of Emile Borel cover any bothersome meta-questions, which are seen as essentially trivial and irrelevant to the work at hand. That is to say, they tend to foster a shared assumption that science is solely based on testable ideas. However, a major tool of their work, probabilistic statistics, rests upon untestable assumptions.

Kolmogorov's axioms

De Finetti's views on probability

On Borel's contributions

Not only do I intend to talk about these assumptions, but to also enter the no-go zone of "metaphysics." Though practicing scientists may prefer to avoid the "forbidden fruit" of ontology and epistemology found in this zone, they certainly will lack an important understanding of what they are doing if they decline to enter. The existence of the raging controversies tends to underscore the point that there is something more that needs understanding than is found in typical probability and statistics books.

Further, I intend to argue that the statistical conception of the world of appearances is only valid under certain conditions and that an unseen "noumenal" world is of great significance and implies a nonlinearity, in the asymmetric n-body sense, that current probability models cannot account for. My notion of a noumenal world is closer to Kant's view, or possibly Spencer's, than to that of the ancient Greeks -- with the proviso that modern logic and physics provide ways to discern by inference aspects of this hidden world.

In addition, I suggest that though a number of exhaustive formalizations of "probability theory" have been proffered, people tend to pilfer a few attractive concepts but otherwise don't take such formalizations very seriously -- though perhaps that assessment does not apply to pure logicians. Similarly, I wonder whether talk of such things as "the topology of a field" adds much to an understanding of probability and its role in science (3). Certainly, few scientists bother with such background considerations.

In the end, we find that the value of a probabilistic method is itself probabilistic. If one is satisfied that the success rate accords with experience, one tends to accept the method. The more so if a group corroborates that assessment.

The usual axioms of probability found in standard statistics textbooks are axioms for a reason: There is no assurance that reality will in fact operate "probabilistically," which is to say we cannot be sure that the definition of randomness we use won't somehow be undermined.

Standard axioms

This is not a trivial matter. How, for example, do we propose to use probability to cope with "backward running time" scenarios that occur in modern physics? Yes, we may have at hand a means of assigning, say, probability amplitudes, but if the cosmos doesn't always work according to our standard assumptions, then we have to question whether what some call a "universe of chance" is sufficient as a model not only of the cosmos at large, but of the "near reality" of our everyday existence (4).

And, as is so often the case in such discussions, a number of definitions are entangled, and hence sometimes we simply have to get the gist (5) of a discussion until certain terms are clarified, assuming they are.

Though we will discuss the normal curve and touch lightly on other distributions, the reader needn't worry that he or she will be subjected to much in the way of intricacies of mathematical statistics. All methods of inferential statistics rest on assumptions concerning probability and randomness, and they will be our main areas of concern.

Types of probability
Rudolf Carnap (6), in an attempt to resolve the controversy between Keynesian subjectivists and Neyman-Pearson frequentists, offered two types of probability: probability1, giving degrees of confidence or "weight of evidence;" and probability2: giving "relative frequency in the long run." In my view, Carnap's two forms are insufficient.

In my classification, we have:

Probability1: Classical, as in proportion of black to white balls in an urn.

Probability2: Frequentist, as in trials of coin flips.

Probability3: Bayesian (sometimes called the probability of causes), as in determining the probability that an event happened, given an initial probability of some other event.

Probability4: Degree of confidence, as in expert opinion. This category is often subsumed under Probability3.

Probability5: "Objective" Bayesian degree of confidence, in which an expert opinion goes hand in hand with relevant frequency ratios -- whether the relative frequency forms part of the initial estimate or whether it arrives in the form of new information.

Probability6: "Subjective" Bayesian degree of confidence, as espoused by De Finetti later in life, whereby not only does probability, in some physical sense, not exist, but degree of belief is essentially a matter of individual perception.

Probability7: Ordinary a priori probability, often termed propensity, associated, for example, with gambling systems. The biases are built into the game based on ordinary frequency logic and, possibly, based on the advance testing of equipment.

Probability8: The propensity of Karl Popper, which he saw as a fundamental physical property that has as much right to exist as, but is distinct from, a force.

Probability9: Standard quantum propensity, in which experimentally determined a priori frequencies for particle detection have been bolstered by the highly accurate quantum formalism.

Probability10: Information theory probability, which is "ordinary" probability; however, subtleties enter into the elucidation of information entropy as distinguished from physical entropy. In terms of ordinary propensity, information theory accounts for the structural constraints, which might be termed advance information. These constraints reduce the new information, sometimes called surprisal value. I' = I - Ic, where I is the new information, I the total information and Ic the structural information.

We will review these concepts to a greater or lesser degree as we proceed. Others have come up with different categorizations. From Bruce Hajek

Hajek on interpretations of probability

we have three main concepts in probability:

1. Quasi-logical: "meant to measure evidential support relations." As in: "In light of the relevant seismological and geological data, it is probable that California will experience a major earthquake this decade."

2. Degree of confidence: As in: "It is probable that it will rain in Canberra."

3. An objective concept: As in: "A particular radium atom will probably decay within 10,000 years."

Ian Hacking wrote that chance began to be tamed in the 19th century when a lot of empirical data were published, primarily by government agencies.

"The published facts about deviancies [variation], and the consequent development of the social sciences, led to the erosion of determinism, so that by the end of the century C.S. Peirce could say we live in a universe of chance."

Hacking saw probability as having two aspects. "It is connected with the degrees of belief warranted by the evidence, and it is connected with the tendencies, displayed by some chance devices, to produce stable relative frequencies" (7).

Another type of probability might be called "nonlinear probability," but I demur to include this as a specific type of probability as this concept essentially falls under the rubric of conditional probability.

By "nonlinear" probability I mean a chain of conditional probabilities that includes a feedback process. If we look at any feedback control system, we see that the output is partly dependent on itself. Many such systems, though not all, are expressed by nonlinear differential equations.

So the probability of a molecule being at point X is influenced by the probabilities of all other molecules. Now, by assumption of randomness of many force vectors, the probabilities in a flowing stream tend to cancel, leaving a constant. However, in a negative feedback system the constants must be different for the main output and the backward-flowing control stream. So we see that in some sense probabilities are "influencing themselves." In a positive feedback loop, the "self-referencing" spirals toward some physical upper bound and again we see that probabilities, in a manner of speaking, are self-conditional.

The feedback system under consideration is the human mind's interaction with the noumenal world, this interaction producing the phenomenal world. For more detail, see sections on the noumenal world (Part VI, see sidebar) and my paper Toward (link in sidebar).

Types of randomness
When discussing probability, we need to think about the complement concept of randomness, which is an assumption necessary for independence of events.

My categories of randomness:
Randomness1: Insufficient computing power leaves much unpredictable. This is seen in nonlinear differential equations, chaos theory and in cases where small truncation differences yield widely diverging trajectories (Lorenz's butterfly effect). In computer lingo, such calculations are known as "hard" computations whose computational work increases exponentially.

Randomness2: Kolmogorov/Chaitin randomness, which is closely related to randomness1. The computational complexity is measured by how close to 1 is the ratio of the algorithm information and its input information versus the output information. (If we reduce algorithms to Turing machines, then the algorithmic and the input information are strung together in a single binary string.)

Chaitin-Kolmogorov complexity

Randomness3: Randomness associated with probability9 seen in quantum effects. Randomness3 is what most would regard as intrinsic randomness. Within the constraints of the Heisenberg Uncertainty Principle one cannot, in principle, predict exactly the occurrence of two properties of a quantum detection event.

Randomness4: Randomness associated with Probability8, the propensity of Popper. It appears to be a mix of randomness3 and randomness1.

Randomness5: The imposition of "willful ignorance" in order to guard against observer bias in a frequency-based experiment.

Randomness6: Most real numbers are not computable. They are inferred in ZFC set theory, inhabiting their own noumenal world. They are so wild that one must consider them to be utterly random. A way to possibly notionally write such a string would be to tie selection of each subsequent digit to a quantum detector. One could never be sure that such a string would not actually randomly find itself among the computables, though it can be claimed that there is a probability 1 (virtual certainty) that it would in fact be among the non-computables. Such a binary string would also, with probability 1, contain infinities of every possible finite substring. These sorts of probability claims are open to question as to whether they represent true knowledge, though for Platonically inclined mathematicians, they are quite acceptable. See my post
A fractal that orders the reals

Benoit Mandelbrot advocated three states of randomness: mild, wild and slow.

By mild, he meant randomness that accords easily with the bell-curved normal distribution. By wild, he meant curves that fluctuate sharply at aperiodic intervals. By slow, he meant a curve that looks relatively smooth (and accords well with the normal distribution) but steadily progresses toward a crisis point that he describes as equivalent to a physical phase shift, which then goes into the wild state.

One can see for example that in the iterative logistic equation, as the initial value increases asymptotically toward 4, we go from simple periodicity to intermittent finite intervals of chaos alternating with periodicity, but at 4, the crisis point, all higher reals produce escape the logistic graph. The Feigenbaum constant is a measure of the tendency toward chaos (trapped aperiodic orbits) and itself might be viewed as a crisis point.

Another way to think of this is in terms of shot noise. Shot noise may increase as we change variables. So the graph of the information stream will show disjoint spikes with amplitudes that indicates the spikes can't be part of the intended message; the spikes may gradually increase in number, until we get to a crisis point, from whence there is more noise than message. We also have the transition from laminar flow to turbulence under various constraints. The transition can be of "short" or "long" duration, where we have a mixture of turbulent vortices with essentially laminar flow.

Mandelbrot wished to express his concepts in terms of fractals, which is another way of saying power laws. Logarithmic and exponential curves generally have points near origin which, when subjectively considered, seem to mark a distinction between the routine change an d bizarre change. Depending on what is being measured, that distinction might occur at 210 or 210.871, or in other words arbitrarily. Or some objective measure can mark the crisis point, such as when noise equals message in terms of bits.

Wikipedia refines Mandelbrot's grading and gives seven states of randomness:

Proper mild randomness: short-run portioning is even for N=2, e.g. the normal distribution
Borderline mild randomness: short-run portioning is concentrated for N=2, but eventually becomes even as N grows, e.g. the exponential distribution with λ=1 Slow randomness with finite delocalized moments: scale factor increases faster than q but no faster than , w<1
Slow randomness with finite and localized moments: scale factor increases faster than any power of q, but remains finite, e.g. the lognormal distribution
Pre-wild randomness: scale factor becomes infinite for q>2 , e.g. the Pareto distribution with α=2.5
Wild randomness: infinite second moment, but finite moment of some positive order, e.g. the Pareto distribution with α=1.5
Extreme randomness: all moments are infinite, e.g. the Pareto distribution with α=1

I have made no attempt to make the numbering of the categories for probability correspond with that for randomness. The types of probability I present do not carry one and only one type of randomness. How they relate to randomness is a supple issue to be discussed as we go along.

In this respect, we shall also examine the issues of ignorance of a deterministic output versus ignorance of an indeterministic (quantum) output.

Types of ignorance
Is your ignorance of what outcome will occur in an experiment utterly subjective or are there physical causes for the ignorance, as in the propensity notion? Each part of that question assumes a strict demarcation between mind and external environment in an experiment, a simplifying assumption in which feedback is neglected (but can it be, really?).

Much of the difficulty in discerning the "meaning of probability" arose with the development of quantum mechanics, which, as Jan von Plato notes, "posed an old question anew, but made it more difficult than ever before to dismiss it: is probability not also to be viewed as ontic, i.e., as a feature of reality, rather than exclusively as epistemic, i.e., as a feature characterizing our state of knowledge?" (8)

The scenario below gives a glimpse of some of the issues which will be followed up further along in this essay.

Insufficient reason
A modern "insufficient reason" scenario:

Alice tells you that she will forward an email from Bob, Christine or Dan within the next five minutes. In terms of prediction, your knowledge can then be summarized as p(X = 1/3) where X = B,C,D. Whether Alice has randomized the order is unknown to you and so to you any potential permutation is just as good as any other. The state of your knowledge is encapsulated by the number 1/3. In effect, you are assuming that Alice has used a randomization procedure that rules out "common" permutations, such as BCD, though you can argue that you are assuming nothing. This holds on the 0th trial.

In this sort of scenario, it seems justifiable to employ the concept of equiprobability, which is a word reflecting minimum knowledge. We needn't worry about what Alice is or isn't doing when we aren't looking. We needn't worry about a hidden influence yielding a specific bias. All irrelevant in such a case (and here I am ignoring certain issues in physics that are addressed in the noumena sections (Part VI, see sidebar) and in Toward).

We have here done an exercise in classical probability and can see how the principle of insufficient reason (called by John Maynard Keynes "the principle of indifference") is taken for granted as a means of describing partial, and, one might say, subjective knowledge. We can say that in this scenario, randomness4 is operative.

However, once one trial occurs, we may wish to do the test again. It is then that "maximum entropy" or well-shuffling may be called for. Of course, before the second trial, you have no idea whether Alice has changed the order. If you have no idea whether she has changed the permutation, then you may wish to look for a pattern that discloses a "tendency" toward nonrandom shuffling. This is where Simon Laplace steps in with his controversial rule of succession, which is intended as a means of determining whether an observed string is nonrandom; there are of course other, modern tests for nonrandomness.

If, however, Alice tells you that each trial is to be independent, then you are accepting her word that an effective randomization procedure is at work. We now enter the realm of the frequentist. Nonparametric tests or perhaps the rule of succession can test the credibility of her assertion. It is here -- in the frequentist doctrine -- where the "law of large numbers" enters the picture. So here we suggest randomness1 and perhaps randomness2; randomness3 concerning quantum effects also applies, but is generally neglected.

We should add that there is some possibility that, despite Alice's choice of permutation, the emails will arrive in an order different from her choice. There are also the possibilities that either the sender's address and name are garbled or that no message arrives at all. Now these last possibilities concern physical actions over which neither sender nor receiver has much control. So one might argue that the subjective "principle of insufficient reason" doesn't apply here. On the other hand, in the main scenario, we can agree that not only is there insufficient reason to assign anything but 1/3 to any outcome, but that also our knowledge is so limited that we don't even know whether there is a bias toward a "common" permutation, such as BCD.

Thus, application of the principle of insufficient reason in a frequentist scenario requires some care. In fact, it has been vigorously argued that this principle and its associated subjectivism entails flawed reasoning, and that only the frequentist doctrine is correct for science.

We can think of the classical probability idea as the 0th trial of a frequency scenario, in which no degree of expertise is required to obtain the number 1/3 as reflecting the chance that you will guess right as to the first email.

Purpose of probability numbers
Though it is possible, and has been done, to sever probability conceptions from their roots in the decision-making process, most of us have little patience with such abstractions, though perhaps a logician might find such an effort worthwhile. However, we start from the position that the purpose of a probability assignment is to determine a preferred course of action. So there may be two courses of action, and we wish to know which, on the available evidence, is the better. Hence what is wanted is an equality or an inequality. We estimate that we are better off following Course A (such as crediting some statement as plausible) than we are if we follow course B (perhaps we have little faith in a second statement's plausibility). We are thus ordering or ranking the two proposed courses of action, and plan to make a decision based on the ranking.

This ranking of proposed actions is often termed the probability of a particular outcome, such as success or failure. The ranking may be made by an expert, giving her degrees of confidence, or it may be made by recourse to the proportions of classical probability, or the frequency ratios of repeated trials of an "equivalent" experiment. However, probability is some process of ranking, or prioritizing, potential courses of action. (We even have meta-analyses, in which the degrees of several experts are averaged.) Even in textbook drills, this purpose of ranking is implied. Many think that it is "obvious" that the frequency method has a built-in objectivity and that observer bias occurs when insufficient care has been taken to screen it out. Hence, as it is seemingly possible to screen out observer bias, what remains of the experiment must be objective. And yet that claim is open to criticism, not least of which is how a frequency ratio is defined and established.

In this respect, we need be aware of the concept of risk. Gerd Gigerenzer in his Calculated Risks (9) presents a number of cases in which medical professionals make bad decisions based on what Gigerenzer sees as misleading statements of risk.

He cites a 1995 Scotland coronary prevention study press release, which claimed that:

  • "People with high cholesterol can rapidly reduce" their risk of death by 22% by taking a specific drug.
  • Of 1,000 people with high cholesterol who took the drug over a five-year period, 32 died.
  • Of 1,000 people who took a placebo over the five-year period, 41 died.
There are three ways, says Gigeranzer, to present the benefit:

1. Absolute risk reduction.
   1000 = 0.9 %
2. Relative risk reduction.
(Absolute risk reduction)
(Number who die who haven't been treated) = 9/41 = 22%

3. Number needed to treat (NNT). Number of people who must participate in a treatment in order to save one life. In this case: 9/1000 =~ 1/111, meaning 111 people is the minimum.

When such reasoning was used to encourage younger women to get breast X-rays, the result was an imposition of excessive anxiety, along with radiation risk, on women without a sufficient reason, Gigerenzer writes.

John Allen Paulos on the mammogram controversy

New knowledge may affect one's estimate of a probability. But how is this new knowledge rated? Suppose you and another are to flip a coin over some benefit. The other provides a coin, and you initially estimate your chance of winning at 1/2, but then another person blurts out: "The coin is loaded."

You may now have doubts about the validity of your estimate, as well as doubt about whether, if the claim is true, the coin is mildly or strongly biased. So, whatever you do, you will use some process of estimation which may be quite appropriate but which might not be easily quantifiable.

And so one may say that the purpose of a probability ranking is to provide a "subjective" means of deciding on a course of action in "objective reality."

Eminent minds have favored the subjectivist viewpoint. For example, Frank P. Ramsey (10) proposed that probability theory represent a "logic of personal beliefs" and notes: "The degree of a belief is just like a time interval [in relativity theory]; it has no precise meaning unless we specify more exactly how it is to be measured."

In addressing this problem, Ramsey cites Mohs' scale of hardness, in which 10 is arbitrarily assigned to a diamond, etc. Using a psychological perspective, Ramsey rates degrees of belief by scale of intensity of feeling, while granting that no one feels strongly about things he takes for granted [unless challenged, we add]. And, though critical of Keynes's A Treatise on Probability, Ramsey observed that all logicians -- Keynes included -- supported the degrees of belief viewpoint whereas statisticians in his time generally supported a frequency theory outlook.

Yet, as Popper points out, "Probability in physics cannot be merely an indication of 'degrees of belief,' for some degrees lead to physically wrong results, whereas others do not" (11).

Overview of key developments on probability
We begin with the track of probability as related to decisions to be made in wagering during finite games of chance. Classical probability says we calculate an estimate based on proportions. The assumption is that the urn's content is "well mixed" or the card deck "well shuffled."

What the classical probabilist is saying is that we can do something with this information, even though the information of the system is incomplete. The classical probabilist had to assume maximum information entropy with proper shuffling, even though this concept had not been developed. Similarly, the "law of large numbers" is implicit. Otherwise, why keep playing a gambling game?

We show in the discussion on entropy below that maximum entropy means loss of all memory or information that would show how to find an output value in say, less than n steps, where n is the number of steps (or, better, bits) in the shuffling algorithm. We might say this maximum Kolmogorov-Chaitin entropy amounts to de facto deterministic irreversibility.

That is, the memory loss or maximum entropy is equivalent to effective "randomization" of card order. Memory loss is implicit, though not specified in Shannon information theory -- though one can of course assert that digital memory systems are subject to the Second Law of Thermodynamics.

But even if the deck is highly ordered when presented to an observer, as long as the observer doesn't know that, his initial probability estimate, if he is to make a bet, must assume well-shuffling, as he has no reason to suspect a specific permutation.

Despite the fact that the "law of large numbers" is implicit in classical thinking, there is no explicit statement of it prior to Jacob Bernoulli and some may well claim that classical probability is not a "frequentist" theory.

Yet it is only a short leap from the classical to the frequentist conception. Consider the urn model, with say 3 white balls and 2 black balls. One may have repeated draws, with replacement, from one urn. Or one may have one simultaneous draw from five urns that each have either a black or white ball. In the first case, we have a serial, or frequentist, outlook. In the second, we have a simple proportion as described by the classical outlook.

The probability of two heads in a row is 1/4, as shown by the table.


Now suppose we have an urn in which we place two balls and specify that there may be 0 to 2 black balls and 0 to 2 white balls. This is the same as having 4 urns, with these contents:


One then is presented with an urn and asked the probability it holds 2 black balls. The answer is 1/4.

Though that result is trivial, the comparison underscores how classical and frequentist probability are intertwined.

So one might argue that the law of large numbers is the mapping of classical probabilities onto a time interval. But if so, then classical probability sees the possibility of maximum entropy as axiomatic (not that early probabilists necessarily thought in those terms). In classical terms, one can see that maximum entropy is equivalent to the principle of insufficient reason, which to most of us seems quite plausible. That is, if I hide the various combinations of balls in four urns and don't let you see me doing the mixing, then your knowledge is incomplete, but sufficient to know that you have one chance in four of being right.

But, one quickly adds, what does that information mean? What can you do with that sort of information as you get along in life? It is here that, if you believe in blind chance, you turn to the "law" of large numbers. You are confident that if you are given the opportunity to choose over many trials, your guess will turn out to have been right about 25% of the time.

So one can say that, at least intuitively, it seems reasonable that many trials, with replacement, will tend to verify the classically derived probability, which becomes the asymptotic limiting value associated with the law of large numbers. Inherent in the intuition behind this "law" is the notion that hidden influences are random -- if we mean by random that over many trials, these influences tend to cancel each other out, analogous to the fact that Earth is essentially neutrally charged because atoms are, over the large, randomly oriented with respect to one another, meaning that the ionic charges virtually cancel out. Notice the circularity problem in that description.

Nevertheless, one could say that we may decide on, say 10 trials of flipping a coin, and assume we'd be right on about 50% of guesses -- as we have as yet insufficient reason to believe the coin is biased. Now consider 10 urns, each of which contains a coin that is either head up or tail up. So your knowledge is encapsulated by the maximum ignorance in this scenario. If asked to bet on the first draw, the best you can do is say you have a 50% chance of being right (as discussed a bit more below). The notion that, on average, hidden force vectors in a probabilistic scenario cancel out might seem valid if one holds to conservation laws, which imply symmetries, but I am uncertain on this point; it is possible that Noether's theorem applies.

So it ought to be apparent that Bernoulli's frequency ideas were based on the principle of insufficient reason, or "subjective" ignorance. The purpose of the early probabilists was to extract information from a deterministic system, some of whose determinants were unknown. In that period, their work was frowned upon because of the belief that the drawing of lots should be reserved for specific moral purposes, such as the settling of an argument or discernment of God's will on a specific course of action. Gambling, though a popular pastime, was considered sinful because God's will was being ignored by the gamers.

Such a view reflects the Mosaic edict that Israel take no census, which means this: To take a census is to count one's military strength, which implies an estimate of the probability of winning a battle or a war. But the Israelis were to trust wholly in their God for victory, and not go into battle without his authorization. Once given the word to proceed, they were to entertain no doubt. This view is contrary to the custom of modern man, who, despite religious beliefs, assesses outcomes probabilistically, though usually without precise quantification.

To modern ears, the idea that probabilities are based on a philosophy that ejects divine providence from certain situations sounds quite strange. And yet, if there is a god, should we expect that such a being leaves certain things to chance? Does blind chance exist, or is that one of the many illusions to which we humans are prone? (I caution that I am not attempting to prove the existence of a deity or to give a probability to that supposition.)

In classical and early frequentist approaches, the "maximum entropy" or well-mixing concept was implicitly assumed. And yet, as Ludwig Boltzmann and Claude Shannon showed, one can think of degrees of entropy that are amenable to calculation.

Von Mises has been called the inventor of modern frequentism, which he tried to put on a firm footing by making axioms of the law of large numbers and of the existence of randomness, by which he meant that, over time, no one could "beat the odds" in a properly arranged gambling system.

The Von Mises axioms

1. The axiom of convergence: "As a sequence of trials is extended, the proportion of favorable outcomes tends toward a definite mathematical limit."

2. The axiom of randomness: "The limiting value of the relative frequency must be the same for all possible infinite subsequences of trials chosen solely by a rule of place selection within the sequence (i.e., the outcomes must be randomly distributed among the trials)."

Alonzo Church on the random sequences of Von Mises

Of course, the immediate objection is that declaring axioms does not necessarily mean that reality agrees. Our collective experience is that reality does often seem to be in accord with Von Mises's axioms. And yet, one cannot say that science rests on a testable foundation, even if nearly all scientists accept these axioms. In fact, it is possible that these axioms are not fully in accord with reality and only work within limited spheres. A case in point is Euclid's parallel postulate, which may not hold at the cosmic scale. In fact, the counterintuitive possibilities for Riemann space demonstrate that axioms agreed to by "sensible" scientists of the Newtonian mold are not necessarily the "concrete fact" they were held to be.

Consider the old syllogism:

1. All men are mortal.
2. Socrates is a man.
3. Hence, Socrates is mortal.

It is assumed that all men are mortal. But suppose in fact 92% of men are mortal. Then conclusion 3 is also not certain, but only rather probable.

Following in David Hume's track, we must concede to having no way to prove statement 1, as there might be some exception that we don't know about. When we say that "all men are mortal," we are relying on our overwhelming shared experience, with scientists proceeding on the assumption that statement 1 is self-evidently true.

If we bring the study of biology into the analysis, we might say that the probability that statement 1 holds is buttressed by both observational and theoretical work. So we would assign the system of members of the human species a propensity of virtually 1 as to mortality. That is, we take into account the systemic and observational evidence in assigning an a priori probability.

And yet, though a frequency model is implicit in statement 1, we cannot altogether rule out an exception, not having the power of perfect prediction. Thus, we are compelled to accept a degree of confidence or degree of belief. How is this to be arrived at? A rough approximation might be to posit a frequency of less than 1 in 7 billion, but that would say that the destiny of everyone alive on Earth today is known. We might match a week's worth of death records over some wide population against a week's worth of birth records in order to justify a statement about the probability of mortality. But that isn't much of a gain. We might as well simply say the probability of universal mortality is held to be so close to 1 as to be accepted as 1.

The difficulties with the relative frequency notion of probability are well summarized by Hermann Weyl (13). Weyl noted that Jacob Bernoulli's earlier parts of his Ars Conjectandi were sprinkled with words connoting subjective ideas, such as "hope" and "expectation." However, in the fourth part of that book, Bernoulli introduces the seemingly objective "law of large numbers," which he established with a mathematical proof. However, says Weyl, the logical basis for that law has remained murky ever since.

Wikipedia article on 'Ars Conjectandi'

Yes, true that Laplace emphasized the aspect of probability with the classical quantitative definition: the quotient of the number of favorable cases over the number of all possible cases, says Weyl. "Yet this definition presupposes explicitly that the different cases are equally possible. Thus, it contains as an aprioristic basis a quantitative comparison of possibilities."

The conundrum of objectivity is underscored by the successful use of inferential physics in the hard and soft sciences, in the insurance business and in industry in general, Weyl points out.

Yet, if probability theory only concerns relative frequencies, we run into a major problem, Weyl argues. Should we not base this frequency interpretation directly on trial series inherent in the law of large numbers? We might say the limiting value is reached as the number of trials increases "indefinitely." But even so it is hard to avoid the fact that we are introducing "the impossible fiction of an infinity of trials having actually been conducted." He adds, "Moreover, one thereby transcends the content of the probability statement. Inasmuch as agreement between relative frequency and probability p is predicted for such a trial series with 'a probability approaching certainty indefinitely,' it is asserted that every series of trials conducted under the same conditions will lead to the same frequency value."

The problem, as Weyl sees it, is that if one favors "strict causality," then the methods of statistical inference must find a "proper foundation in the reduction to strict law" but, this ideal seems to run into the limit of partial acausality at the quantum level. Weyl thought that perhaps physics could put statistical inference on a firm footing, giving the physical example of equidistribution of gas molecules, based on the notion that forces among molecules are safe to ignore in that they tend to cancel out. But here the assumption behind this specimen of the law of large numbers has a physical basis -- namely Newtonian physics, which, in our terms, provides the propensity information that favors the equiprobabilities inherent in equidistribution.

However, I do not concede that this example proves anything. Really, the kinetic gas theories of Maxwell, Boltzmann and Gibbs tend to assert the Newtonian mechanics theory, but are based on the rough and ready relative-frequency empirico-inductive perception apparatus used by human beings and other mammals.

How does one talk about frequencies for infinite sets? In classical mechanics, a molecule's net force vector might point in any direction and so the probability of any specific direction equals zero, leading Weyl to remark that in such continuous cases one can understand why measure theory was developed.

On measure theory

Two remarks:

1. In fact, the force vector's possible directions are limited by Planck's constant, meaning we have a large population of discrete probabilities which can very often be treated as an infinite set.

2. Philosophically, one may agree with Newton and construe an infinitesimal as a discrete unit that exists in a different realm than that of the reals. We see a strong echo of this viewpoint in Cantor's cardinal numbers representing different orders of infinity.

An important development around the turn of the 19th century was the emergence of probabilistic methods of dealing with error in observation and measurement. How does one construct a "good fit" curve from observations which contain seemingly random errors? By "random" an observer means that he has insufficient information to pinpoint the source of the error, or that its source isn't worth the bother of determining. (The word "random" need not be used only in this sense.)

The binomial probability formula is simply a way of expressing possible proportions using combinatorial methods; it is a logical tool for both classical and frequentist calculations.

Now this formula (function) can be mapped onto a Cartesian grid. What it is saying is that finite probabilities are highest for sets with the highest finite numbers of elements, or permutations. As a simple example, consider a coin-toss experiment. Five flips yields j heads and k tails, where j or k = 0 to 5.

This gives the binomial result:

5C5 = 1, 5C4 = 5, 5C3 = 10, 5C2 = 10, 5C1 = 5, 5C0 = 1.

You can, obviously, visualize a symmetrical graph with two center bars of 10 units high flanked on both sides by bars of diminishing height.

Now if the probability of p = 1/2 and q = 1/2 (the probability of occurrence and the probability of non-occurrence), we get the symmetrical graph:

1/32, 5/32, 10/32, 10/32, 5/32, 1/32

We see here that there are 10 permutations with 3 heads and 2 tails and 10 with 3 tails and 2 heads in which chance of success or failure is equal. So, if you are asked to bet on the number of heads turning up in 5 tosses, you should -- assuming some form of randomness -- choose either 3 or 2.

Clearly, sticking with the binomial case, there is no reason not to let the number of notional tosses go to infinity, in which case every specific probability reduces to zero. Letting the binomial graph go to infinity gives us the Gaussian normal curve. The normal curve is useful because calculational methods have been worked out that make it more convenient than binomial (or multinomial) probability calculation. And, it turns out that as n increases in the binomial case, probabilities that arise from situations where replacement is logically required are nevertheless well approximated by probabilities arising with the no-replacement assumption.

So binomial probabilities are quite well represented by the Gaussian curve when n is large enough. Note that, implicitly, we are assuming "well mixing" or maximum entropy.

So the difference between the mean and the next unit shrinks with n.

50C25/51C25 = 0.509083922

and if we let n run to infinity, that ratio goes exactly to 0.5.

So it made sense, as a useful calculational tool, to use the Gaussian curve, where n runs to infinity.

Yet one should beware carelessly assuming that such a distribution is some form of "objective" representation of reality. As long as no one is able to fully define the word "random" in whatever aspect, then no one can say that the normal curve serves as a viable approximate representation of some arena of reality. Obviously, however, that distribution has proved to be immensely productive in certain areas of science and industry -- though one should not fail to appreciate its history of misuse. At any rate, a great advantage of the normal curve is that it so well represents the binomial distribution.

Certainly, we have here an elegant simplification, based on the assumption of well-mixing or maximum entropy. As long as we use the normal distribution to approximate the possibilities for a finite experiment, that simplification will be accepted by many as reasonable. But if the bell curve is meant to represent some urn containing an infinitude of potential events, then the concept of normal distribution becomes problematic. That is, any finite cluster of, say, heads, can come up an infinity of times. We can say our probability of witnessing such a cluster is low, but how do we ensure well mixing to make sure that that belief holds? If we return to the urn model, how could we ensure maximally entropic shuffling of an infinitude of black and possibly white balls? We have no recourse but to appeal to an unverifiable Platonic ideal or perhaps to say that the principle of insufficient reason is, from the observer's perspective, tantamount to well mixing. (Curiously, the axiom of choice of Zermelo-Fraenkel set theory enters the picture here, whereby one axiomatically is able to obtain certain subsets of an infinitude.)

Keynes takes aim at the principle of indifference (or, in our terms, zero propensity information) in this passage:

"If, to take an example, we have no information whatever as to the area or population of the countries of the world, a man is as likely to be an inhabitant of Great Britain as of France, there being no reason to prefer one alternative to the other.

"He is also as likely to be an inhabitant of Ireland as of France. And on the same principle he is as likely to be an inhabitant of the British Isles as of France. And yet these conclusions are plainly inconsistent. For our first two propositions together yield the conclusion that he is twice as likely to be an inhabitant of the British Isles as of France. Unless we argue, as I do not think we can, that the knowledge that the British Isles are composed of Great Britain and Ireland is a ground for supposing that a man is more likely to inhabit them than France, there is no way out of the contradiction. It is not plausible to maintain, when we are considering the relative populations of different areas, that the number of names of subdivisions which are within our knowledge, is, in the absence of any evidence as to their size, a piece of relevant evidence.

"At any rate, many other similar examples could be invented, which would require a special explanation in each case; for the above is an instance of a perfectly general difficulty. The possible alternatives may be a, b, c, and d, and there may be no means of discriminating between them; but equally there may be no means of discriminating between (a or b), c, and d" (14).

Modern probability texts avoid this difficulty by appeal to set theory. One must properly define sets before probabilities can be assigned.

Two points:
1. For most purposes, no one would gain knowledge via applying probability rankings to Keynes's scenario. However, that doesn't mean no situation will ever arise when it is not worthwhile to apply probabilistic methods, though of course the vagueness of the sets makes probability estimates equally vague.

2. If we apply set theory, we are either using naive set theory, where assumptions are unstated, or axiomatic set theory, which rests on unprovable assertions. In the case of standard ZFC set theory, Goedel's incompleteness theorem means that the formalism is either incomplete or inconsistent. Further, it is not known whether ZFC is both incomplete and inconsistent.

Randomness4 arises when willful ignorance is imposed as a means of obtaining a classical form of probability, or of having insufficient reason to regard events as other than equiprobable.

Consider those exit polls that include late sampling, which are the only exit polls where it can be assumed that the sample set yields a quantity close to the ratio for the entire number of votes cast.

This is so, it is generally believed, because if the pollsters are doing their jobs properly, the pollster's selection of every nth person leaving a polling station screens out any tendency to select people who are "my kind."

In fact, the exit poll issue underscores an existential conundrum: suppose the exit poll ratio for candidate A is within some specified margin of error for a count of the entire vote. That is to say, with a fairly high number of ballots there is very likely to be occasional ballot-count errors, which, if random, will tend to cancel. But the level of confidence in the accuracy of the count may be only, say, 95%. If the exit poll has a low margin of error in the counting of votes -- perhaps the pollsters write down responses with 99% accuracy -- then one may find that the exit poll's accuracy is better than the accuracy of the entire ballot count.

A recount may only slightly improve the accuracy of the entire ballot count. Or it may not provably increase its accuracy at all, if the race is especially tight and the difference is within the margin of error for ballot counting.

A better idea might be to have several different exit polls conducted simultaneously with an average of results taken (the averaging might be weighted if some exit pollsters have a less reliable track record than others).

So as we see -- even without the theorems of Kurt Goedel and Alan Turing and without appeal to quantum phenomena -- some statements may be undecidable. It may be impossible to get definitive proof that candidate A won the election, though in most cases recounts, enough attention to sources of bias would possibly drastically alter the error probabilities. But even then, one can't be certain that in a very tight races the outcome wasn't rigged.

It is important to understand that when randomness4 is deployed, the assumption is that influences that would favor a bias tend to cancel out. In the case of an exit poll, it is assumed that voters tend to arrive at and leave the polling station randomly (or at least pseudorandomly). The minor forces affecting their order of exiting tend to cancel, it is believed, permitting confidence in a sample based on every nth voter's disclosure of her vote.

In another important development in probability thinking, Karl Popper in the mid-20th century proposed the propensity idea as a means of overcoming the issue of "subjectivity," especially in the arena of quantum mechanics. This idea says that physical systems have elementary propensities or tendencies to yield a particular proposition about some property. In his thinking, propensity is no more "occult" a notion than the notion of force. The propensity can be deduced because it is an a priori (a term Popper disdains) probability that is fundamental to the system. The propensity is as elementary a property as is the spin of an electron; it can't be further reduced or described in terms of undetected vibrations (though he didn't quite say that).

The propensity probability can be approximated via repeated trials, but applies immediately on the first trial. By this, Popper avoids the area of hidden variables and in effect quantizes probability, though he doesn't admit to having done so. What he meant to do was minimize quantum weirdness so as to save appearances, or that is, "external" reality.

Popper wasn't always clear as to what he meant by realism, but it is safe to assume he wanted the laws of physics to hold whether or not he was sleeping. Even so, he was forced to concede that it might be necessary to put up with David Bohm's interpretation of quantum behaviors, which purports to save realism only by sacrificing bilocality, and agreeing to "spooky action at a distance."

The notion of propensity may sometimes merge with standard ideas of statistical inference. Consider this passage from J.D. Stranathan's history of experimental physics:

"In the case of the Abraham theory the difference between the observed and calculated values of m/mo are all positive; and these differences grow rather large at the higher velocities. On the Lorentz theory the differences are about as often positive as negative; the sum of the positive errors is nearly equal to the sum of negative ones. Furthermore, there is no indication that the error increases at the higher velocities. These facts indicate that the errors are not inherent in the theory; the Lorentz theory describes accurately the observed variation" (15).

Hendrik A. Lorentz was first to propose relativistic mass change for the electron only, an idea generalized by Einstein to apply to any sort of mass. Max Abraham clung to a non-relativistic ether theory.


1. I do not regard myself as an expert statistician. When I wish to obtain some statistical result, I consult my textbooks. Neither do I rate myself as extremely facile with probability calculations, though I sometimes enjoy challenging questions.
2. At times we will use the word "probability" as short for "probabilistic thinking."
3. That is not to say that topologies can have no affect on probability calculations. For example, topologies associated with general relativity certainly may be relevant in probability calculations or are not relevant when it comes to such concepts as the four-dimensional spacetime block.
4. A term attributed to the 19th century logician and philosopher C.S. Peirce.
5. The convenient word "gist" has an echo of some noumenal realm, as it stems from the word "geist," as in "spirit" or "ghost."
6. Logical Foundations of Probability by Rudolph Carnap (University of Chicago, 1950).
7. From the 2006 introduction to The Emergence of Probability: A Philosophical Study of Early Ideas about Probability, Induction and Statistical Inference by Ian Hacking (Cambridge, 1975, 2006).
8. Jan von Plato in The Probability Revolution vol 2: ideas in the sciences; Kruger, Gigerenzer, and Morgan, editors (MIT Press, 1987).
9. Calculated Risks: How to know when numbers deceive you by Gerd Gigerenzer (Simon and Schuster, 2002).
10. Truth and Probability (1926) by Frank P. Ramsey, appearing in The Foundations of Mathematics and Other Logical Essays.
11. The Logic of Scientific Discovery by Karl Popper. Published as Logik der Forschung in 1935; English version published by Hutchinson in 1959.
Kluwer Academic Publishers 2001.
13. The Philosophy of Mathematics and Science by Hermann Weyl (Princeton University Press, 1949, 2009).
14. A Treatise on Probability by J.M. Keynes (Macmillan, 1921).
15. The "Particles" of Modern Physics by J.D. Stranathan (Blakiston, 1942).

No comments:

Post a Comment