Law of large numbers
In my view, a basic idea behind the "law of large numbers" is that minor influences tend to cancel each other out asymptotic to infinity. We might consider these influences to be small force vectors that can have a butterfly effect as to which path is taken. If we map these small vectors onto a sine wave graph, we can see heuristically how the little bumps above the axis tend to be canceled by the little bumps below the axis, for partially destructive interference. We can also see how small forces so mapped occasionally superpose in a constructive way, where, if the amplitude is sufficient, a "tipping point" is reached and the coin falls head side up.
In fact, two forms -- the weak and the strong -- of this law have been elucidated. This distinction however doesn't address the fundamental issues that have been raised.
On the law of large numbers
http://www.mhhe.com/engcs/electrical/papoulis/graphics/ppt/lectr13a.pdf
The strong law
http://mathworld.wolfram.com/StrongLawofLargeNumbers.html
The weak law
http://mathworld.wolfram.com/WeakLawofLargeNumbers.html
Keynes raises, from a logician's perspective, strong objections to the law of large numbers, though he considers them minor from the physicist's point of view. His solution is to eschew explaining reproducible regularities in terms of accumulations of accidents. "...for I do not assert the identity of the physical and the mathematical concept of probability at all; on the contrary, I deny it" (15).
This amounts to tossing out the logical objections to the "law," and accepting that "law" on an ad hoc or axiomatic basis. However, he makes an attempt at a formalistic resolution.
Neither is that law always valid, he says. "The rule that extreme probabilities have to be neglected ... agrees with the demand for scientific objectivity." That is, there is the "obvious objection" that even an enormous improbability always remains a probability, however small, and that consequently even the most "impossible" processes -- i.e., those which we propose to neglect -- will someday happen. And that someday could be today.
Keynes, to make his point, cites some extraordinarily improbable distributions of gas molecules in the Maxwell's demon thought experiment. "Even if a physicist happened to observe such a process, he would be quite unable to reproduce it, and therefore would never be able to decide what had really happened in this case, and whether he had made an observational mistake" (16).
Citing Arthur Eddington's statement to the effect that some things in nature are impossible, while other things don't happen because of their remote probability, Keynes says that he prefers to avoid non-testable assertions about whether extremely improbable things in fact occur. Yet, he observes that Eddington's assertion agrees well with how the physicist applies probability theory (17).
I note that if a probability is so remote as to be untestable via experiment, then, as E.T. Jaynes says, a frequentist model is not necessarily hard and fast. It can only be assumed that the probability assignments are adequate guides for some sort of decision-making. Testing is out of the question for extreme cases.
So, I suggest that Keynes here is saying that the scientific basis for probability theory is intuition.
A problem with Keynes's skepticism regarding highly improbable events is that without them, the notion of randomness loses some of its power.
The mathematics of the the chaos and catastrophe theories make this clear. In the case of a "catastrophe" model of a continuously evolving dynamical system, sudden discrete jumps to a new state are inevitable, though it may not be so easy to say when such a transition will occur.
Concerning catastrophe theory
http://www.physics.drexel.edu/~bob/PHYS750_NLD/Catastrophe_theory.pdf
Nonlinear dynamics and chaos theory
http://www2.bren.ucsb.edu/~kendall/pubs_old/2001ELS.pdf
We also must beware applying the urn of nature scenario. An urn has one of a set of ratios of white to black balls. But, a nonlinear dynamic system is problematic for modeling by an urn. Probabilities apply well to uniform, which is to say, for practical purposes, periodic systems. One might possibly justify Laplace's rule of succession on this basis. However, quasi-periodic systems may well give a false sense of security, perhaps masking sudden jolts into atypical, possibly chaotic, behavior. Wasn't everyone carrying on as usual when in 2004 a tsunami killed 230,000 people in 14 countries bordering the Indian Ocean?
So we must be very cautious about how we use probabilities concerning emergence of high-information systems. Here is why: A sufficiently rich mix of chemical compounds may well form a negative feedback dynamical system. It would then be tempting to apply a normal probability distribution to such a system, and that distribution very well may yield reasonable results for a while. But, if the dynamical system is nonlinear -- which most are -- the system could reach a threshold, akin to a chaos point, at which it crosses over into a positive feedback system or into a substantially different negative feedback system.
The closer the system draws to that tipping point, the less the normal distribution applies. In chaotic systems, normal probabilities, if applicable, must be applied with great finesse. Hence to say that thus and such an outcome is highly improbable based on the previous state of the system is to misunderstand how nonlinearities can work. In other words, a Markov process (see below) is often inappropriate for predicting "highly improbable" events, though it may do as a good enough approximation in many nonlinear scenarios.
It is noteworthy that Keynes thought that the work of Pafnuty Chebyshev and Andrey Markov should replace Laplace's rule, implying that he thought a Markov process adequate for most probabilistic systems (18). Certainly he could not have known much of what came to be known as chaos theory and nonlinear dynamics.
Another issue is the fact that an emergent property may not be obvious until it emerges (echoes of David Bohm's "implicate order"). Consider the Moebius band. Locally, the surface is two-sided, such that a vector orthogonal to the surface has a mirror vector pointing in the opposite direction. Yet, at the global scale, the surface is one-sided and a mirror vector is actually pointing out from the same surface as its partner is.
If a paper model of a Moebius strip were partially shown through a small window and an observer were asked if she thought the paper was two-sided, she would reply: "Of course." Yet, at least at a certain scale at which thickness is ignored, the paper strip has one side.
What we have in the case of catastrophe and chaos events is often called pseudorandomness, or effectively incalculable randomness. In the Moebius band case, we have a difficulty on the part of the observer of conceptualizing emergent properties, an effect also found in Ramsey order.
We can suggest the notion of unfoldment of information, thus: We have a relation R representing some algorithm.
Let us suppose an equivalence relation such that
(i) aRrefa < -- > aRrefa (reflexivity).
(ii) aRsymb < -- > bRsyma (symmetry) .
(iii) aRtrnb and bRtrnc -- > aRtrnc (transitivity).
The redundancy, or structural information, is associated with R. So aRa corresponds to 0 Shannon information in the output. The reflexivity condition is part of the structural information for R, but this redundancy is irrelevant for Rreflex. The structural information is relevant in the latter two cases. In those cases, if we do not know the structure or redundancy in R, we say the information is enfolded. Once we have discovered some algorithm for R, then we say the information has been revealed and is close to zero, but not quite zero, as we may not have advance knowledge concerning the variables.
Some would argue that what scientists mean by order is well summarized by an order relation aRb, such that A X B is symmetric and transitive but not reflexive. However, I have yet to convince myself on this point.
Ramsey order
John Allen Paulos points out an important result of network theory that guarantees that some sort of order will emerge. Ramsey proved a "strange theorem," stating that if one has a sufficiently large set of geometric points and every pair of them is connected by either a red line or a green line (but not by both), then no matter how one paints the lines, there will always be a large subset of the original set with a special property. Either every pair of the subset's members will be connected by a red line or every pair of the subset's members will be connected by a green line.
"If, for example, you want to be certain of having at least three points all connected by red lines or at least three points all connected by green lines, you will need at least six points," says Paulos.
"For you to be certain that you will have four points, every pair of which is connected by a red line, or four points, every pair of which is connected by a green line, you will need 18 points, and for you to be certain that there will be five points with this property, you will need -- it's not known exactly -- between 43 and 55. With enough points, you will inevitably find unicolored islands of order as big as you want, no matter how you color the lines," he notes.
Paulos on emergent order
http://abcnews.go.com/Technology/WhosCounting/story?id=4357170&page=1
In other words, no matter what type or level of randomness is at work, "order" must emerge from such networking. Hence one might run across a counterintuitive subset and think its existence highly improbable, that the subsystem can't have fallen together randomly. So again, we must beware the idea that "highly improbable" events are effectively nonexistent. Yes, if one is applying probabilities at a near-zero propensity, and using some Bayesian insuffucient reason rationale, then such an emergent event would be counted as virtually impossible. But, with more knowledge of the system dynamics, we must parse our probabilistic questions more finely.
On the other hand, intrinsic fundamental randomness is often considered doubtful except in the arena of quantum mechanics -- although quantum weirdness does indeed scale up into the "macro" world (see noumena sections in Part VI, link in sidebar). Keynes of course knew nothing about quantum issues at the time he wrote Treatise.
Kolmogorov used his axioms to try to avoid Keynesian difficulties concerning highly improbable events.
Kolmogorov's 1933 book (19) gives these two conditions:In my view, a basic idea behind the "law of large numbers" is that minor influences tend to cancel each other out asymptotic to infinity. We might consider these influences to be small force vectors that can have a butterfly effect as to which path is taken. If we map these small vectors onto a sine wave graph, we can see heuristically how the little bumps above the axis tend to be canceled by the little bumps below the axis, for partially destructive interference. We can also see how small forces so mapped occasionally superpose in a constructive way, where, if the amplitude is sufficient, a "tipping point" is reached and the coin falls head side up.
In fact, two forms -- the weak and the strong -- of this law have been elucidated. This distinction however doesn't address the fundamental issues that have been raised.
On the law of large numbers
http://www.mhhe.com/engcs/electrical/papoulis/graphics/ppt/lectr13a.pdf
The strong law
http://mathworld.wolfram.com/StrongLawofLargeNumbers.html
The weak law
http://mathworld.wolfram.com/WeakLawofLargeNumbers.html
Keynes raises, from a logician's perspective, strong objections to the law of large numbers, though he considers them minor from the physicist's point of view. His solution is to eschew explaining reproducible regularities in terms of accumulations of accidents. "...for I do not assert the identity of the physical and the mathematical concept of probability at all; on the contrary, I deny it" (15).
This amounts to tossing out the logical objections to the "law," and accepting that "law" on an ad hoc or axiomatic basis. However, he makes an attempt at a formalistic resolution.
Neither is that law always valid, he says. "The rule that extreme probabilities have to be neglected ... agrees with the demand for scientific objectivity." That is, there is the "obvious objection" that even an enormous improbability always remains a probability, however small, and that consequently even the most "impossible" processes -- i.e., those which we propose to neglect -- will someday happen. And that someday could be today.
Keynes, to make his point, cites some extraordinarily improbable distributions of gas molecules in the Maxwell's demon thought experiment. "Even if a physicist happened to observe such a process, he would be quite unable to reproduce it, and therefore would never be able to decide what had really happened in this case, and whether he had made an observational mistake" (16).
Citing Arthur Eddington's statement to the effect that some things in nature are impossible, while other things don't happen because of their remote probability, Keynes says that he prefers to avoid non-testable assertions about whether extremely improbable things in fact occur. Yet, he observes that Eddington's assertion agrees well with how the physicist applies probability theory (17).
I note that if a probability is so remote as to be untestable via experiment, then, as E.T. Jaynes says, a frequentist model is not necessarily hard and fast. It can only be assumed that the probability assignments are adequate guides for some sort of decision-making. Testing is out of the question for extreme cases.
So, I suggest that Keynes here is saying that the scientific basis for probability theory is intuition.
A problem with Keynes's skepticism regarding highly improbable events is that without them, the notion of randomness loses some of its power.
The mathematics of the the chaos and catastrophe theories make this clear. In the case of a "catastrophe" model of a continuously evolving dynamical system, sudden discrete jumps to a new state are inevitable, though it may not be so easy to say when such a transition will occur.
Concerning catastrophe theory
http://www.physics.drexel.edu/~bob/PHYS750_NLD/Catastrophe_theory.pdf
Nonlinear dynamics and chaos theory
http://www2.bren.ucsb.edu/~kendall/pubs_old/2001ELS.pdf
We also must beware applying the urn of nature scenario. An urn has one of a set of ratios of white to black balls. But, a nonlinear dynamic system is problematic for modeling by an urn. Probabilities apply well to uniform, which is to say, for practical purposes, periodic systems. One might possibly justify Laplace's rule of succession on this basis. However, quasi-periodic systems may well give a false sense of security, perhaps masking sudden jolts into atypical, possibly chaotic, behavior. Wasn't everyone carrying on as usual when in 2004 a tsunami killed 230,000 people in 14 countries bordering the Indian Ocean?
So we must be very cautious about how we use probabilities concerning emergence of high-information systems. Here is why: A sufficiently rich mix of chemical compounds may well form a negative feedback dynamical system. It would then be tempting to apply a normal probability distribution to such a system, and that distribution very well may yield reasonable results for a while. But, if the dynamical system is nonlinear -- which most are -- the system could reach a threshold, akin to a chaos point, at which it crosses over into a positive feedback system or into a substantially different negative feedback system.
The closer the system draws to that tipping point, the less the normal distribution applies. In chaotic systems, normal probabilities, if applicable, must be applied with great finesse. Hence to say that thus and such an outcome is highly improbable based on the previous state of the system is to misunderstand how nonlinearities can work. In other words, a Markov process (see below) is often inappropriate for predicting "highly improbable" events, though it may do as a good enough approximation in many nonlinear scenarios.
It is noteworthy that Keynes thought that the work of Pafnuty Chebyshev and Andrey Markov should replace Laplace's rule, implying that he thought a Markov process adequate for most probabilistic systems (18). Certainly he could not have known much of what came to be known as chaos theory and nonlinear dynamics.
Another issue is the fact that an emergent property may not be obvious until it emerges (echoes of David Bohm's "implicate order"). Consider the Moebius band. Locally, the surface is two-sided, such that a vector orthogonal to the surface has a mirror vector pointing in the opposite direction. Yet, at the global scale, the surface is one-sided and a mirror vector is actually pointing out from the same surface as its partner is.
If a paper model of a Moebius strip were partially shown through a small window and an observer were asked if she thought the paper was two-sided, she would reply: "Of course." Yet, at least at a certain scale at which thickness is ignored, the paper strip has one side.
What we have in the case of catastrophe and chaos events is often called pseudorandomness, or effectively incalculable randomness. In the Moebius band case, we have a difficulty on the part of the observer of conceptualizing emergent properties, an effect also found in Ramsey order.
We can suggest the notion of unfoldment of information, thus: We have a relation R representing some algorithm.
Let us suppose an equivalence relation such that
(i) aRrefa < -- > aRrefa (reflexivity).
(ii) aRsymb < -- > bRsyma (symmetry) .
(iii) aRtrnb and bRtrnc -- > aRtrnc (transitivity).
The redundancy, or structural information, is associated with R. So aRa corresponds to 0 Shannon information in the output. The reflexivity condition is part of the structural information for R, but this redundancy is irrelevant for Rreflex. The structural information is relevant in the latter two cases. In those cases, if we do not know the structure or redundancy in R, we say the information is enfolded. Once we have discovered some algorithm for R, then we say the information has been revealed and is close to zero, but not quite zero, as we may not have advance knowledge concerning the variables.
Some would argue that what scientists mean by order is well summarized by an order relation aRb, such that A X B is symmetric and transitive but not reflexive. However, I have yet to convince myself on this point.
Ramsey order
John Allen Paulos points out an important result of network theory that guarantees that some sort of order will emerge. Ramsey proved a "strange theorem," stating that if one has a sufficiently large set of geometric points and every pair of them is connected by either a red line or a green line (but not by both), then no matter how one paints the lines, there will always be a large subset of the original set with a special property. Either every pair of the subset's members will be connected by a red line or every pair of the subset's members will be connected by a green line.
"If, for example, you want to be certain of having at least three points all connected by red lines or at least three points all connected by green lines, you will need at least six points," says Paulos.
"For you to be certain that you will have four points, every pair of which is connected by a red line, or four points, every pair of which is connected by a green line, you will need 18 points, and for you to be certain that there will be five points with this property, you will need -- it's not known exactly -- between 43 and 55. With enough points, you will inevitably find unicolored islands of order as big as you want, no matter how you color the lines," he notes.
Paulos on emergent order
http://abcnews.go.com/Technology/WhosCounting/story?id=4357170&page=1
In other words, no matter what type or level of randomness is at work, "order" must emerge from such networking. Hence one might run across a counterintuitive subset and think its existence highly improbable, that the subsystem can't have fallen together randomly. So again, we must beware the idea that "highly improbable" events are effectively nonexistent. Yes, if one is applying probabilities at a near-zero propensity, and using some Bayesian insuffucient reason rationale, then such an emergent event would be counted as virtually impossible. But, with more knowledge of the system dynamics, we must parse our probabilistic questions more finely.
On the other hand, intrinsic fundamental randomness is often considered doubtful except in the arena of quantum mechanics -- although quantum weirdness does indeed scale up into the "macro" world (see noumena sections in Part VI, link in sidebar). Keynes of course knew nothing about quantum issues at the time he wrote Treatise.
Kolmogorov used his axioms to try to avoid Keynesian difficulties concerning highly improbable events.
A. One can be practically certain that if C is repeated a large number of times, the relative frequency of E will differ very little from the probability of E. [He axiomatizes the law of large numbers.]
B. If P(E) is very small, one can be practically certain that when C is carried out only once, the event E will not occur at all.
But, if faint chances of occurrences are ruled out beyond some limit, doesn't this really go to the heart of the meaning of randomness?
Kolmogorov's 'Foundations' in English
http://www.mathematik.com/Kolmogorov/index.html
And, if as Keynes believed, randomness is not all that random, we lose the basic idea of independence of like events, and we bump into the issue of what is meant by a "regularity" (discussed elsewhere).
Statisticians of the 19th century, of course, brought the concept of regularity into relief. Their empirical methods disclosed various recurrent patterns, which then became fodder for the methods of statistical inference. In those years, scientists such as William Stanley Jevons began to introduce probabilistic methods. It has been argued that Jevons used probability in terms of determining whether events result from certain causes as opposed to simple coincidences, and via the method of the least squares. The first approach, notes the Stanford Encyclopedia of Philosophy, "entails the application of the 'inverse method' in induction: if many observations suggest regularity, then it becomes highly improbable that these result from mere coincidence."
Encyclopedia entry on Jevons
http://plato.stanford.edu/entries/william-jevons/
Jevons also employed the method of least squares to try to detect regularities in price fluctuations, the encyclopedia says.
Statistical regularities, in my way of thinking, are a reflection of how the human mind organizes the perceived world, or world of phenomena. The brain is programed to find regularities (patterns) and to rank them -- for the most part in an autonomic fashion -- as an empirico-inductivist-frequentist mechanism for coping.
Yet, don't statistical regularities imply an objective randomness which implies a reality larger than self? My take is that the concept of intrinsic randomness serves as an idealization, which serves our desire for a mathematical, formalistic representation of the phenomenal world and in particular, serves our desire to predict properties of macro-states by using the partial, or averaged, information we have of the micro-states, as when we obtain the macro-state information of the threshold line for species extinction, which serves to cover the not-very-accessible information for the many micro-states of survival and mortality of individuals.
Averaging however does not imply intrinsic randomness. On the other hand, the typical physical assumption that events are distinct and do not interact without recourse to known physical laws implies independence of events, which in turn implies effectively random influences. In my estimation, this sort of randomness is a corollary of the reductionist, isolationist and simplificationist method of typical science, an approach that can be highly productive, as when Claude Shannon ignored the philosophical ramifications of the meaning of information.
The noted probability theorist Mark Kac gives an interesting example of the relationship of a deterministic algorithm and randomness [35].
Consider the consecutive positive integers {1, 2, ... ,n} -- say with n = 104 and corresponding to each integer m in this range, and then consider the number f(m) of the integer's different prime factors.
Hence, f(1) = 0, f(2) = f(3) = f(5) = 1, f(4) = 22 = 1, f(6) = f(2*3) = 2, f(60) = f(22*3*5) = 3, and so forth.
Kac assigns these outputs to a histogram of the number of prime divisors, using ln (ln n) and adjusting suitably the size of the base interval. He obtains an excellent approximation -- which improves as n rises -- to the normal curve. The statistics of the number of prime factors is, Kac wrote, indistinguishable from the statistics of the sizes of peas or the statistics of displacement in Brownian motion. And yet, the algorithm is fully deterministic, meaning from his perspective that there is neither chance nor randomness.
We note that in classical dynamical systems, there is also no intrinsic randomness, and that probabilities are purportedly determined by activities below the threshold of convenient observation. And yet the fact that prime factors follow the normal curve is remarkable and deserving of further attention. There should, one would think, be a relationship between this fact and Riemann's conjecture.
Interestingly primes fall in the holes of the sieve of Eratosthenes, implying that they do not follow any algebraic formula (which they do not except in a very special case that is not applicable to the general run of algebraic formulas). Is it so surprising that primes occur "erratically" when one sees that they are "anti-algebraic"? In general, non-algebraic algorithms produce outputs that are difficult to pin down exactly for some future state. Hence, probabilistic methods are called for. In that case, one hopes that some probability distribution/density will fit well enough.
But, the fact that the normal curve is the correct distribution is noteworthy, as is the fact that the samples of prime factors follow the central limit theorem.
My take is that the primes do not fall in an equally probable pattern, a fact that is quite noticeable for low n. However, as n increases, the dependence tends to weaken. So at 104 the dependence among prime factors is barely detectable, making their detections effectively independent events. In other words, the deterministic linkages among primes tend to cancel or smear out, in a manner similar to sub-threshold physical variables tending to cancel.
In a discussion of the Buffon's needle problem and Bertrand's paradox, Kac wishes to show that if probability theory is made sufficiently rigorous, the layperson's concerns about its underlying value can be answered. He believes that sufficient rigor will rid us of the "plague of paradoxes" entailed by the different possible answers to the so-called paradoxes.
However, 50 years after Kac's article, Bertrand's paradox still stimulates controversy. The problem is often thought to be resolved by specification of the proper method of setting up an experiment. That is, the conceptual probability is not divorced from the mechanics of an actual experiment, at least in this case.
And because actual Buffon needle trials can be used to arrive at acceptable values of pi, we have evidence that the usual method of computing the Buffon probabilities is correct, and further that the notion of equal probabilities is for this problem a valid assumption, though Kac argued that only a firm mathematical foundation would validate that assumption.
A useful short discussion on Bertrand's paradox is found here:
Wikipedia article on Bertrand's paradox
https://en.wikipedia.org/wiki/Bertrand_paradox_%28probability%29
At any rate, whether the universe ( = both the phenomenal and noumenal worlds) follows the randomness assumptions above is open to debate. Note that the ancients had a law of gravity (though not articulated as such). Their empirical observations told them that an object falls to the ground if not supported by other objects. The frequency ratio was so high that any any exception would have been regarded as supernatural. These inductive observations led to the algorithmic assessments of Galileo and Newton. These algorithmic representations are very successful, in limited cases, at prediction. These representations are deductive systems. Plug in the numbers, compute, and, in many cases, out come the predictive answers. And yet the highly successful systems of Newton and Einstein cannot be used, logically, as a means of excluding physical counterexamples. Induction supports the deductive systems, and cannot be dispensed with.
A statement such as "The next throw of a die showing 5 dots has a probability of 1/6" is somewhat inadequate because probabilities, says Popper, cannot be ascribed to a single occurrence of an event, but only to infinite sequences of occurrences (i.e., back to the law of large numbers). He says this because he is saying that any bias in the die can only be logically ruled out by an infinity of trials (20). Contrast that with Weyl's belief (21) that symmetry can provide the basis for an expectation of zero bias (see Part IV) and with my suggestion that below a certain threshold, background vibrations may make no difference.
One can see the harbinger of the "law of large numbers" in the urn model's classical probability: If an urn contains say 5 white balls and 2 black, then our ability to predict the outcome is given by the numbers 2/5 and 3/5. But why is that so? Answer: it is assumed that if one conducts enough experiments, with replacement, guesses for black or white will asymptotically approach the ratios 2/5 and 3/5. Yet, why do we consider that assumption reasonable in "real life" and without bothering with formalities? We are accepting the notion that the huge aggregate set of minor force vectors, or "causes," tends to be neutral. There are two things to say about this:
Kolmogorov's 'Foundations' in English
http://www.mathematik.com/Kolmogorov/index.html
And, if as Keynes believed, randomness is not all that random, we lose the basic idea of independence of like events, and we bump into the issue of what is meant by a "regularity" (discussed elsewhere).
Statisticians of the 19th century, of course, brought the concept of regularity into relief. Their empirical methods disclosed various recurrent patterns, which then became fodder for the methods of statistical inference. In those years, scientists such as William Stanley Jevons began to introduce probabilistic methods. It has been argued that Jevons used probability in terms of determining whether events result from certain causes as opposed to simple coincidences, and via the method of the least squares. The first approach, notes the Stanford Encyclopedia of Philosophy, "entails the application of the 'inverse method' in induction: if many observations suggest regularity, then it becomes highly improbable that these result from mere coincidence."
Encyclopedia entry on Jevons
http://plato.stanford.edu/entries/william-jevons/
Jevons also employed the method of least squares to try to detect regularities in price fluctuations, the encyclopedia says.
Statistical regularities, in my way of thinking, are a reflection of how the human mind organizes the perceived world, or world of phenomena. The brain is programed to find regularities (patterns) and to rank them -- for the most part in an autonomic fashion -- as an empirico-inductivist-frequentist mechanism for coping.
Yet, don't statistical regularities imply an objective randomness which implies a reality larger than self? My take is that the concept of intrinsic randomness serves as an idealization, which serves our desire for a mathematical, formalistic representation of the phenomenal world and in particular, serves our desire to predict properties of macro-states by using the partial, or averaged, information we have of the micro-states, as when we obtain the macro-state information of the threshold line for species extinction, which serves to cover the not-very-accessible information for the many micro-states of survival and mortality of individuals.
Averaging however does not imply intrinsic randomness. On the other hand, the typical physical assumption that events are distinct and do not interact without recourse to known physical laws implies independence of events, which in turn implies effectively random influences. In my estimation, this sort of randomness is a corollary of the reductionist, isolationist and simplificationist method of typical science, an approach that can be highly productive, as when Claude Shannon ignored the philosophical ramifications of the meaning of information.
The noted probability theorist Mark Kac gives an interesting example of the relationship of a deterministic algorithm and randomness [35].
Consider the consecutive positive integers {1, 2, ... ,n} -- say with n = 104 and corresponding to each integer m in this range, and then consider the number f(m) of the integer's different prime factors.
Hence, f(1) = 0, f(2) = f(3) = f(5) = 1, f(4) = 22 = 1, f(6) = f(2*3) = 2, f(60) = f(22*3*5) = 3, and so forth.
Kac assigns these outputs to a histogram of the number of prime divisors, using ln (ln n) and adjusting suitably the size of the base interval. He obtains an excellent approximation -- which improves as n rises -- to the normal curve. The statistics of the number of prime factors is, Kac wrote, indistinguishable from the statistics of the sizes of peas or the statistics of displacement in Brownian motion. And yet, the algorithm is fully deterministic, meaning from his perspective that there is neither chance nor randomness.
We note that in classical dynamical systems, there is also no intrinsic randomness, and that probabilities are purportedly determined by activities below the threshold of convenient observation. And yet the fact that prime factors follow the normal curve is remarkable and deserving of further attention. There should, one would think, be a relationship between this fact and Riemann's conjecture.
Interestingly primes fall in the holes of the sieve of Eratosthenes, implying that they do not follow any algebraic formula (which they do not except in a very special case that is not applicable to the general run of algebraic formulas). Is it so surprising that primes occur "erratically" when one sees that they are "anti-algebraic"? In general, non-algebraic algorithms produce outputs that are difficult to pin down exactly for some future state. Hence, probabilistic methods are called for. In that case, one hopes that some probability distribution/density will fit well enough.
But, the fact that the normal curve is the correct distribution is noteworthy, as is the fact that the samples of prime factors follow the central limit theorem.
My take is that the primes do not fall in an equally probable pattern, a fact that is quite noticeable for low n. However, as n increases, the dependence tends to weaken. So at 104 the dependence among prime factors is barely detectable, making their detections effectively independent events. In other words, the deterministic linkages among primes tend to cancel or smear out, in a manner similar to sub-threshold physical variables tending to cancel.
In a discussion of the Buffon's needle problem and Bertrand's paradox, Kac wishes to show that if probability theory is made sufficiently rigorous, the layperson's concerns about its underlying value can be answered. He believes that sufficient rigor will rid us of the "plague of paradoxes" entailed by the different possible answers to the so-called paradoxes.
However, 50 years after Kac's article, Bertrand's paradox still stimulates controversy. The problem is often thought to be resolved by specification of the proper method of setting up an experiment. That is, the conceptual probability is not divorced from the mechanics of an actual experiment, at least in this case.
And because actual Buffon needle trials can be used to arrive at acceptable values of pi, we have evidence that the usual method of computing the Buffon probabilities is correct, and further that the notion of equal probabilities is for this problem a valid assumption, though Kac argued that only a firm mathematical foundation would validate that assumption.
A useful short discussion on Bertrand's paradox is found here:
Wikipedia article on Bertrand's paradox
https://en.wikipedia.org/wiki/Bertrand_paradox_%28probability%29
At any rate, whether the universe ( = both the phenomenal and noumenal worlds) follows the randomness assumptions above is open to debate. Note that the ancients had a law of gravity (though not articulated as such). Their empirical observations told them that an object falls to the ground if not supported by other objects. The frequency ratio was so high that any any exception would have been regarded as supernatural. These inductive observations led to the algorithmic assessments of Galileo and Newton. These algorithmic representations are very successful, in limited cases, at prediction. These representations are deductive systems. Plug in the numbers, compute, and, in many cases, out come the predictive answers. And yet the highly successful systems of Newton and Einstein cannot be used, logically, as a means of excluding physical counterexamples. Induction supports the deductive systems, and cannot be dispensed with.
A statement such as "The next throw of a die showing 5 dots has a probability of 1/6" is somewhat inadequate because probabilities, says Popper, cannot be ascribed to a single occurrence of an event, but only to infinite sequences of occurrences (i.e., back to the law of large numbers). He says this because he is saying that any bias in the die can only be logically ruled out by an infinity of trials (20). Contrast that with Weyl's belief (21) that symmetry can provide the basis for an expectation of zero bias (see Part IV) and with my suggestion that below a certain threshold, background vibrations may make no difference.
One can see the harbinger of the "law of large numbers" in the urn model's classical probability: If an urn contains say 5 white balls and 2 black, then our ability to predict the outcome is given by the numbers 2/5 and 3/5. But why is that so? Answer: it is assumed that if one conducts enough experiments, with replacement, guesses for black or white will asymptotically approach the ratios 2/5 and 3/5. Yet, why do we consider that assumption reasonable in "real life" and without bothering with formalities? We are accepting the notion that the huge aggregate set of minor force vectors, or "causes," tends to be neutral. There are two things to say about this:
1. This sort of randomness excludes the operation of a God or superior being. At one time, the study of probabilities with respect to games of chance was frowned upon on grounds that it was blasphemous to ignore God's influence or to assume that that influence does not exist (Bernoulli was prudently circumspect on this issue). We understand that at this point, many react: "Aha! Now you are bringing in religion!" But the point here is that the conjecture that there is no divine influence is an article of faith among some scientifically minded persons. This idea of course gained tremendous momentum from Darwin's work.
2. Results of modern science profoundly challenge what might be called a "linear perspective" that permits "regularities" and the "cancelation of minor causes." As we show in the noumena sections (Part VI, see sidebar), strange results of both relativity theory and quantum mechanics make the concept of time very peculiar indeed, meaning that causality is stood on its head.
Keynes tells his readers that Siméon Denis Poisson brought forth the concept of the "law of large numbers" that had been used by Bernoulli and other early probabilists. "It is not clear how far Poisson's result [the law of large numbers as he extended it] is due to a priori reasoning, and how far
it is a natural law based on experience; but it is represented as displaying a certain harmony between natural law and the a priori reasoning of probabilities."
The French statistician Adolph Quetelet, says Keynes, did a great deal to explain the use of statistical methods. Quetelet "belongs to the long line of brilliant writers, not yet extinct, who have prevented probability from becoming, in the scientific salon, perfectly respectable. There is still about it for scientists a smack of astrology, of alchemy" (21a). It is difficult to exorcise this suspicion because, in essence, the law of large numbers rests on an unprovable assumption, though one that tends to accord with experience.
This is not to say that various people have not proved the weak and strong forms once assumptions are granted, as in the case of Borel, who was a major contributor to measure theory, which he and others have used in their work on probability. Yet, we do not accept that because a topological framework exists that encompasses probability ideas, it follows that the critical issues have gone away.
On Adolph Quetelet
http://mnstats.morris.umn.edu/introstat/history/w98/Quetelet.html
Keynes makes a good point about Poisson's apparent idea that if one does enough sampling and analysis, "regularities" will appear in various sets. However, notes Keynes, one should beware the idea that "because the statistics are numerous, the observed degree of frequency is therefore stable."
Keynes's insight can be appreciated with respect to iterative feedback functions. Those which tend to stability (where the iterations are finitely periodic) may be thought of in engineering terms as displaying negative feedback. Those that are chaotic (or pre-chaotic with spurts of instability followed by spurts of stability) are analogous to positive feedback systems. So, here we can see that if a "large" sample is drawn from a pre-chaotic system's spurt of stability, a wrong conclusion will be drawn about the system's regularity. And again we see that zero or near-zero propensity information, coupled with the assumption that samples represent the population (which is not to say that samples are not normally distributed), can yield results that are way off base.
Probability distributions
If we don't have at hand a set of potential ratios, how does one find the probability of a probability? If we assume that the success-failure model is binomial, then of course we can apply the normal distribution of probabilities. With an infinite distribution, we don't get the probability of a probability, of course, though we would if we used the more precise binomial distribution with n finite. But, we see that in practice, the "correct" probability distribution is often arrived at inductively, after sufficient observations. The Poisson distribution is suited to rare events; the exponential distribution to radioactive decay. In the latter case, it might be argued that along with induction is the deductive method associated with the rules of quantum mechanics.
Clearly, there is an infinitude of probability distributions. But in the physical world we tend to use a very few: among them, the uniform, the normal and the exponential. So a non-trivial question is: what is the distribution of these distributions, if any? That is, can one rationally assign a probability that a particular element of that set is reflective of reality? Some would argue that here is the point of the Bayesians. Their methods, they say, give the best ranking of initial probabilities, which, by implication suggest the most suitable distribution.
R.A. Fisher devised the maximum likelihood method for determining the probability distribution that best fits the data, a method he saw as superior to the inverse methods of Bayesianism (see below). But, in Harold Jeffreys's view, the maximum likelihood is a measure of the sample alone; to make an inference concerning the whole class, we combine the likelihood with an assessment of prior belief using Bayes's theorem (22).
Jeffreys took maximum likelihood to be a variation of inverse probability with the assumption of uniform priors.
In Jae Myung on maximum likelihood
http://people.physics.anu.edu.au/~tas110/Teaching/Lectures/L3/Material/Myung03.pdf
For many sorts of data, there is the phenomenon known as Benford's law, in which digit probabilities are not distributed normally but logarithmically. Not all data sets conform to this distribution. For example, if one takes data from plants that manufacture beer in liters and then converts those data to gallons, one wouldn't expect that the distribution of digits remains the same in both cases. True, but there is a surprise.
In 1996, Theodore Hill, upon offering a proof of Benford's law, said that if distributions are taken at random and random samples are taken from each of these distributions, the significant digit frequencies of the combined samples would converge to the logarithmic distribution, such that probabilities favor the lower digits in a base 10 system. Hill refers to this effect as "random samples from random distributions." As Julian Havil observed, "In a sense, Benford's Law is the distribution of distributions!" (23).
MathWorld: Benford's law
http://mathworld.wolfram.com/BenfordsLaw.html
Hill's derivation of Benford's law
http://www.gatsby.ucl.ac.uk/~turner/TeaTalks/BenfordsLaw/stat-der.pdf
Though this effect is quite interesting, it is not evident to me how one would go about applying it in order to discover a distribution beyond the logarithmic. Nevertheless, the logarithmic distribution does seem to be what emerges from the general set of finite data. Even so, however, Hill's proof appears to show that low digit bias is an objective artifact of the "world of data" that we commonly access. The value of this distribution is shown by its use as an efficient computer coding tool.
It is not excessive to say that Benford's law, and its proof, encapsulates very well the whole of the statistical inference mode of reasoning. And yet plainly Benford's law does not mean that "fluke" events don't occur. And who knows what brings about flukes? As I argue in Toward, the mind of the observer can have a significant impact on the outcome.
Another point to take into consideration is the fact that all forms of probability logic bring us to the self-referencing conundrums of Bertrand Russell and Kurt Goedel. These are often dismissed as trivial. And yet, if a sufficiently rich system cannot be both complete and consistent, then we know that there is an enforced gap in knowledge. So we may think we have found a sublime truth in Benford's law, and yet we must face the fact that this law, and probabilistic and mathematical reasoning in general, cannot account for all things dreamt of, or undreamt of, in one's philosophy.
Concerning Bayesianism
The purpose of this paper is not to rehash the many convolutions of Bayesian controversies, but rather to spotlight a few issues that may cause the reader to re-evaluate her conception of a "probabilistic universe." (The topic will recur beyond this section.)
"Bayesianism" is a term that has come to cover a lot of ground. Bayesian statistical methods these days employ strong computational power to achieve results barely dreamt of in the pre-cyber era.
However, two concepts run through the heart of Bayesianism: Bayes's formula for conditional probability and the principle of insufficient reason or some equivalent. Arguments concern whether "reasonable" initial probabilities are a good basis for calculation and whether expert opinion is a valid basis for an initial probability. Other arguments concern whether we are only measuring a mental state or whether probabilities have some inherent physical basis external to the mind. Further, there has been disagreement over whether Bayesian statistical inference for testing hypotheses is well-grounded in logic and whether the calculated results are meaningful.
The clash is important because Bayesian methods tend to be employed by economists and epidemiologists and so affect broad government policies.
"The personal element is recognized by all statisticians," observes David Howie. "For Bayesians, it is declared on the choice of prior probabilities; for Fisherians in the construction of statistical model; for the Neyman-Pearson school in the selection of competing hypotheses. The social science texts, however, portrayed statistics as a purely impersonal and objective method for the design of experiments and the representation of knowledge" (24).
Similarly, Gerd Gigerenzer argues that a conspiracy by those who control social science journals has brought about the "illusion of a mechanized inference process." Statistics textbooks for social science students have, under publisher pressure, tended to omit or play down not only the personality conflicts among pioneering statisticians but also the differences in reasoning, Gigerenzer says. Such textbooks presented a hybrid of the methods of R.A. Fisher and of Jerzy Neyman and Egon Pearson, without alerting students as to the schisms among the trailblazers, or even, in most cases, mentioning their names. The result, says Gigerenzer, is that: "Statistics is treated as abstract truth, the monolithic logic of inductive inference."
Gigerenzer on 'mindless statistics'
http://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf
In the last decade, a chapter on Bayesian methods has become de rigeur for statistics texts. However, it remains true that students are given the impression that statistical inferences are pretty much cut and dried, though authors often do stress the importance of asking the right questions when setting up a method of attack on a problem.
A useful explanation of modern Bayesian reasoning is given by Michael Lavine:
What is Bayesian statistics and why everything else is wrong
http://www.math.umass.edu/~lavine/whatisbayes.pdf
The German tank problem gives an interesting example of a Bayesian analysis.
The German tank problem
http://en.wikipedia.org/wiki/German_tank_problem
In the late 19th century, Charles S. Peirce denounced the Bayesian view and tried to assure that frequency ratios are the basis of scientific probability.
C.S. Peirce on probability
http://plato.stanford.edu/entries/peirce/#prob
This view, as espoused by Von Mises, was later carried forward by Popper (25), who eventually replaced it with his propensity theory (26), which is also anti-Bayesian in character.
Expert opinion
One justification of Bayesian methods is the use of a "reasonable" initial probability arrived at by the opinion of an expert or experts. Nate Silver points out for example that scouts did better at predicting who would make a strong ball player than did his strictly statistical method, prompting him to advocate a combination of the subjective expert opinion along with standard methods of statistical inference.
"If prospect A is hitting .300 with twenty home runs and works at a soup kitchen during his off days, and prospect B is hitting .300 with twenty home runs but hits up night clubs during his free time, there is probably no way to quantify this distribution," Silver writes. "But you'd sure as hell want to take it into account."
Silver notes that the arithmetic mean of several experts tends to yield more accurate predictions than the predictions of any single expert (27).
Obviously, quantification of expert opinion is merely a convenience. Such an expert is essentially using probability inequalities, as in p(x) < p(y) < p(z) or p(x) < [1 - p(x)].
Sometimes when I go to a doctor, the nurse asks me to rate pain on a scale of 1 to 10. I am the expert, and yet I have difficulty with this question most of the time. But if I am shown a set of stick figure faces, with various expressions, I can always find the one that suits my subjective feeling. Though we are not specifically talking of probabilities, we are talking about the information inherent in inequalities and how that information need not always be quantified.
Similarly, I suggest that experts do not use fine-grained degrees of confidence, but generally stick with a simple ranking system, such as {1/100, 1/4, 1/3, 1/2, 2/3, 3/4, 99/100}. It is important to realize that a ranking system can be mapped onto a circle, thus giving a system of pseudo-percentages. This is the custom. But the numbers, not representing frequencies, cannot be said to represent percentages. An exception is the case of an expert who has a strong feel for the frequencies and uses her opinion as an adequate approximation of some actual frequency.
Often, what Bayesians do is to use an expert opinion for the initial probability and then apply the Bayesian formula to come up with frequency probabilities. Some of course argue that if we plug in a pseudo-frequency and use the Bayesian formula (including some integral forms) for an output, then all one has is a pseudo-frequency masquerading as a frequency. However, it is possible to think about this situation differently. One hazards a guess as to the initial frequency -- perhaps based on expert opinion -- and then looks at whether the output frequency ratio is reasonable. That is, a Bayesian might argue that he is testing various initial values to see which yields an output that accords with observed facts.
One needn't always use the Bayesian formula to use this sort of reasoning.
Consider the probability of the word "transpire" in an example of what some would take as Bayesian reasoning. I am fairly sure it is possible, with much labor, to come up with empirical frequencies of that word that could be easily applied. But, from experience, I feel very confident in saying that far fewer than 1 in 10 books of the type I ordinarily read have had that word appear in the past. I also feel confident that a typical reader of books will agree with that assessment. So in that case, it is perfectly reasonable to plug in the value 0.1 in doing a combinatorial probability calculation for a recently read string of books. If, of the last 15 books I have read, 10 have contained the word "transpire," we have 10C15 x (1/10)10 x (9/10)5 = 1.77 x 10(-7). That is, the probability of such a string of books occurring nonrandomly is much less than 1 in 10 million.
This sort of "Bayesian" inference is especially useful when we wish to establish an upper bound of probability, which, as in the "transpire" case, may be all we need.
One may also argue for a "weight of evidence" model, which may or may not incorporate Bayes's theorem. Basically, the underlying idea is that new knowledge affects the probability of some outcome. Of course, this holds only if the knowledge is relevant, which requires "reasonableness" in specific cases, where a great deal of background information is necessary. But this doesn't mean that the investigator's experience won't be a reliable means of evaluating the information and arriving at a new probability, arguments of Fisher, Popper and others notwithstanding.
A "weight of evidence" approach of course is nothing but induction, and requires quite a bit of "subjective" expert opinion.
On this point, Keynes, in his Treatise, writes: "Take, for instance, the intricate network of arguments upon which the conclusions of The Origin of Species are founded: How impossible it would be to transform them into a shape in which they would be seen to rest upon statistical frequencies!" (28)
Mendelism and the statistical population genetics pioneered by J.B. Haldane, Sewall Wright and Fisher were still in the early stages when Keynes wrote this. And yet, Keynes's point is well taken. The expert opinion of Darwin the biologist was on the whole amply justified (29) when frequency-based methods based on discrete alleles became available (superseding much of the work of Francis Galton).
Three pioneers of the 'modern synthesis'
http://evolution.berkeley.edu/evolibrary/article/history_19
About Francis Galton
http://www.psych.utah.edu/gordon/Classes/Psy4905Docs/PsychHistory/Cards/Galton.html
Keynes notes that Darwin's lack of statistical or mathematical knowledge is notable and, in fact, a better use of frequencies would have helped him. Even so, Darwin did use frequencies informally. In fact, he was using his expert opinion as a student of biology to arrive at frequencies -- though not numerical ones, but rather rule-of-thumb inequalities of the type familiar to non-mathematical persons. From this empirico-inductive method, Darwin established various propositions, to which he gave informal credibility rankings. From these, he used standard logical implication, but again informally.
One must agree here with Popper's insight that the key idea comes first: Darwin's notion of natural selection was based on the template of artificial selection for traits in domestic animals, although he did not divine the driving force --eventually dubbed "survival of the fittest" -- behind natural selection until coming across a 1798 essay by Thomas Malthus.
Essay on the Principle of Population
http://www.ucmp.berkeley.edu/history/malthus.html
Keynes argues that the frequency of some observation and its probability should not be considered to be identical. (This led Carnap to define two forms of probability, though unlike Keynes, he was only interested in frequentist probability.) One may well agree that a frequency gives a number. Yet there must be some way of connecting it to degrees of belief that one ought to have. On the other hand, who actually has a degree of belief of 0.03791? Such a number is only valuable if it helps the observer to discriminate among inequalities, as in p(y) << p(x) < p(z).
One further point: The ordinary human mind-body system usually learns through an empirico-inductive frequency-payoff method, as I describe in Toward. So it makes sense that a true expert would have assimilated much knowledge into her autonomic systems, analogous to algorithms used in computing pattern detection and "auto-complete" systems. Hence one might argue that, at least in some cases, there is strong reason to view the "subjective" opinion as a good measuring rod. Of course, then we must ask, How reliable is the expert? And it would seem a frequency analysis of her predictions would be the way to go.
Studies of polygraph and fingerprint examiners have shown that in neither of those fields does there seem to be much in the way of corroboration that these forensic tools have any scientific value. At the very least, such studies show that the abilities of experts vary widely (30). This is an appropriate place to bring up the matter of the "prosecutor's fallacy," which I describe here:
The prosecutor's fallacy
http://kryptograff.blogspot.com/2007/07/probability-and-prosecutor-there-are.html
Here we run into the issue of false positives. A test can have a probability of accuracy of 99 percent, and yet the probability that that particular event is a match can have a very low probability. Take an example given by mathematician John Allen Paulos. Suppose a terrorist profile program is 99 percent accurate and let's say that 1 in a million Americans is a terrorist. That makes 300 terrorists. The program would be expected to catch 297 of those terrorists. However, the program has an error rate of 1 percent. One percent of 300 million Americans is 3 million people. So a data-mining operation would turn up some 3 million "suspects" who fit the terrorist profile but are innocent nonetheless. So the probability that a positive result identifies a real terrorist is 297 divided by 3 million, or about one in 30,000 -- a very low likelihood.
But data mining isn't the only issue. Consider biometric markers, such as a set of facial features, fingerprints or DNA patterns. The same rule applies. It may be that if a person was involved in a specific crime or other event, the biometric "print" will finger him or her with 99 percent accuracy. Yet context is all important. If that's all the cops have got, it isn't much. Without other information, the odds are still tens of thousands to one that the police or the Border Patrol have the wrong person.
The practicality of so-called Bayesian reasoning has been given by Enrico Fermi, who would ask his students to estimate how many piano tuners were working in Chicago. Certainly, one should be able to come up with plausible ballpark estimates based on subjective knowledge.
Conant on Enrico Fermi and a 9/11 plausibility test
http://znewz1.blogspot.com/2006/11/enrico-fermi-and-911-plausibility-test.html
I have also used the Poisson distribution for a Bayesian-style approach to the probability that wrongful executions have occurred in the United States.
Fatal flaws
http://znewz1.blogspot.com/2007/06/fatal-flaws.html
Some of my assumptions in those discussions are open to debate, of course.
More on Laplace's rule
The physicist Harold Jeffreys agrees with Keynes that the rule of succession isn't plausible without modification, that is via some initial probability. In fact the probability in the Laplacian result of (m+1)/(m+2) after one success is 2/3 that the next trial will succeed by this route -- which, for some experimental situations, Jeffreys regards as too low, rather than too high!
I find it interesting that economist Jevons's use of the Laplacian formula echoes the doomsday argument of Gott. Jevons observed that "if we suppose the sun to have risen demonstratively" one billion times, the probability that it will rise again, on the ground of this knowledge merely, is
The French statistician Adolph Quetelet, says Keynes, did a great deal to explain the use of statistical methods. Quetelet "belongs to the long line of brilliant writers, not yet extinct, who have prevented probability from becoming, in the scientific salon, perfectly respectable. There is still about it for scientists a smack of astrology, of alchemy" (21a). It is difficult to exorcise this suspicion because, in essence, the law of large numbers rests on an unprovable assumption, though one that tends to accord with experience.
This is not to say that various people have not proved the weak and strong forms once assumptions are granted, as in the case of Borel, who was a major contributor to measure theory, which he and others have used in their work on probability. Yet, we do not accept that because a topological framework exists that encompasses probability ideas, it follows that the critical issues have gone away.
On Adolph Quetelet
http://mnstats.morris.umn.edu/introstat/history/w98/Quetelet.html
Keynes makes a good point about Poisson's apparent idea that if one does enough sampling and analysis, "regularities" will appear in various sets. However, notes Keynes, one should beware the idea that "because the statistics are numerous, the observed degree of frequency is therefore stable."
Keynes's insight can be appreciated with respect to iterative feedback functions. Those which tend to stability (where the iterations are finitely periodic) may be thought of in engineering terms as displaying negative feedback. Those that are chaotic (or pre-chaotic with spurts of instability followed by spurts of stability) are analogous to positive feedback systems. So, here we can see that if a "large" sample is drawn from a pre-chaotic system's spurt of stability, a wrong conclusion will be drawn about the system's regularity. And again we see that zero or near-zero propensity information, coupled with the assumption that samples represent the population (which is not to say that samples are not normally distributed), can yield results that are way off base.
Probability distributions
If we don't have at hand a set of potential ratios, how does one find the probability of a probability? If we assume that the success-failure model is binomial, then of course we can apply the normal distribution of probabilities. With an infinite distribution, we don't get the probability of a probability, of course, though we would if we used the more precise binomial distribution with n finite. But, we see that in practice, the "correct" probability distribution is often arrived at inductively, after sufficient observations. The Poisson distribution is suited to rare events; the exponential distribution to radioactive decay. In the latter case, it might be argued that along with induction is the deductive method associated with the rules of quantum mechanics.
Clearly, there is an infinitude of probability distributions. But in the physical world we tend to use a very few: among them, the uniform, the normal and the exponential. So a non-trivial question is: what is the distribution of these distributions, if any? That is, can one rationally assign a probability that a particular element of that set is reflective of reality? Some would argue that here is the point of the Bayesians. Their methods, they say, give the best ranking of initial probabilities, which, by implication suggest the most suitable distribution.
R.A. Fisher devised the maximum likelihood method for determining the probability distribution that best fits the data, a method he saw as superior to the inverse methods of Bayesianism (see below). But, in Harold Jeffreys's view, the maximum likelihood is a measure of the sample alone; to make an inference concerning the whole class, we combine the likelihood with an assessment of prior belief using Bayes's theorem (22).
Jeffreys took maximum likelihood to be a variation of inverse probability with the assumption of uniform priors.
In Jae Myung on maximum likelihood
http://people.physics.anu.edu.au/~tas110/Teaching/Lectures/L3/Material/Myung03.pdf
For many sorts of data, there is the phenomenon known as Benford's law, in which digit probabilities are not distributed normally but logarithmically. Not all data sets conform to this distribution. For example, if one takes data from plants that manufacture beer in liters and then converts those data to gallons, one wouldn't expect that the distribution of digits remains the same in both cases. True, but there is a surprise.
In 1996, Theodore Hill, upon offering a proof of Benford's law, said that if distributions are taken at random and random samples are taken from each of these distributions, the significant digit frequencies of the combined samples would converge to the logarithmic distribution, such that probabilities favor the lower digits in a base 10 system. Hill refers to this effect as "random samples from random distributions." As Julian Havil observed, "In a sense, Benford's Law is the distribution of distributions!" (23).
MathWorld: Benford's law
http://mathworld.wolfram.com/BenfordsLaw.html
Hill's derivation of Benford's law
http://www.gatsby.ucl.ac.uk/~turner/TeaTalks/BenfordsLaw/stat-der.pdf
Though this effect is quite interesting, it is not evident to me how one would go about applying it in order to discover a distribution beyond the logarithmic. Nevertheless, the logarithmic distribution does seem to be what emerges from the general set of finite data. Even so, however, Hill's proof appears to show that low digit bias is an objective artifact of the "world of data" that we commonly access. The value of this distribution is shown by its use as an efficient computer coding tool.
It is not excessive to say that Benford's law, and its proof, encapsulates very well the whole of the statistical inference mode of reasoning. And yet plainly Benford's law does not mean that "fluke" events don't occur. And who knows what brings about flukes? As I argue in Toward, the mind of the observer can have a significant impact on the outcome.
Another point to take into consideration is the fact that all forms of probability logic bring us to the self-referencing conundrums of Bertrand Russell and Kurt Goedel. These are often dismissed as trivial. And yet, if a sufficiently rich system cannot be both complete and consistent, then we know that there is an enforced gap in knowledge. So we may think we have found a sublime truth in Benford's law, and yet we must face the fact that this law, and probabilistic and mathematical reasoning in general, cannot account for all things dreamt of, or undreamt of, in one's philosophy.
Concerning Bayesianism
The purpose of this paper is not to rehash the many convolutions of Bayesian controversies, but rather to spotlight a few issues that may cause the reader to re-evaluate her conception of a "probabilistic universe." (The topic will recur beyond this section.)
"Bayesianism" is a term that has come to cover a lot of ground. Bayesian statistical methods these days employ strong computational power to achieve results barely dreamt of in the pre-cyber era.
However, two concepts run through the heart of Bayesianism: Bayes's formula for conditional probability and the principle of insufficient reason or some equivalent. Arguments concern whether "reasonable" initial probabilities are a good basis for calculation and whether expert opinion is a valid basis for an initial probability. Other arguments concern whether we are only measuring a mental state or whether probabilities have some inherent physical basis external to the mind. Further, there has been disagreement over whether Bayesian statistical inference for testing hypotheses is well-grounded in logic and whether the calculated results are meaningful.
The clash is important because Bayesian methods tend to be employed by economists and epidemiologists and so affect broad government policies.
"The personal element is recognized by all statisticians," observes David Howie. "For Bayesians, it is declared on the choice of prior probabilities; for Fisherians in the construction of statistical model; for the Neyman-Pearson school in the selection of competing hypotheses. The social science texts, however, portrayed statistics as a purely impersonal and objective method for the design of experiments and the representation of knowledge" (24).
Similarly, Gerd Gigerenzer argues that a conspiracy by those who control social science journals has brought about the "illusion of a mechanized inference process." Statistics textbooks for social science students have, under publisher pressure, tended to omit or play down not only the personality conflicts among pioneering statisticians but also the differences in reasoning, Gigerenzer says. Such textbooks presented a hybrid of the methods of R.A. Fisher and of Jerzy Neyman and Egon Pearson, without alerting students as to the schisms among the trailblazers, or even, in most cases, mentioning their names. The result, says Gigerenzer, is that: "Statistics is treated as abstract truth, the monolithic logic of inductive inference."
Gigerenzer on 'mindless statistics'
http://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf
In the last decade, a chapter on Bayesian methods has become de rigeur for statistics texts. However, it remains true that students are given the impression that statistical inferences are pretty much cut and dried, though authors often do stress the importance of asking the right questions when setting up a method of attack on a problem.
A useful explanation of modern Bayesian reasoning is given by Michael Lavine:
What is Bayesian statistics and why everything else is wrong
http://www.math.umass.edu/~lavine/whatisbayes.pdf
The German tank problem gives an interesting example of a Bayesian analysis.
The German tank problem
http://en.wikipedia.org/wiki/German_tank_problem
In the late 19th century, Charles S. Peirce denounced the Bayesian view and tried to assure that frequency ratios are the basis of scientific probability.
C.S. Peirce on probability
http://plato.stanford.edu/entries/peirce/#prob
This view, as espoused by Von Mises, was later carried forward by Popper (25), who eventually replaced it with his propensity theory (26), which is also anti-Bayesian in character.
Expert opinion
One justification of Bayesian methods is the use of a "reasonable" initial probability arrived at by the opinion of an expert or experts. Nate Silver points out for example that scouts did better at predicting who would make a strong ball player than did his strictly statistical method, prompting him to advocate a combination of the subjective expert opinion along with standard methods of statistical inference.
"If prospect A is hitting .300 with twenty home runs and works at a soup kitchen during his off days, and prospect B is hitting .300 with twenty home runs but hits up night clubs during his free time, there is probably no way to quantify this distribution," Silver writes. "But you'd sure as hell want to take it into account."
Silver notes that the arithmetic mean of several experts tends to yield more accurate predictions than the predictions of any single expert (27).
Obviously, quantification of expert opinion is merely a convenience. Such an expert is essentially using probability inequalities, as in p(x) < p(y) < p(z) or p(x) < [1 - p(x)].
Sometimes when I go to a doctor, the nurse asks me to rate pain on a scale of 1 to 10. I am the expert, and yet I have difficulty with this question most of the time. But if I am shown a set of stick figure faces, with various expressions, I can always find the one that suits my subjective feeling. Though we are not specifically talking of probabilities, we are talking about the information inherent in inequalities and how that information need not always be quantified.
Similarly, I suggest that experts do not use fine-grained degrees of confidence, but generally stick with a simple ranking system, such as {1/100, 1/4, 1/3, 1/2, 2/3, 3/4, 99/100}. It is important to realize that a ranking system can be mapped onto a circle, thus giving a system of pseudo-percentages. This is the custom. But the numbers, not representing frequencies, cannot be said to represent percentages. An exception is the case of an expert who has a strong feel for the frequencies and uses her opinion as an adequate approximation of some actual frequency.
Often, what Bayesians do is to use an expert opinion for the initial probability and then apply the Bayesian formula to come up with frequency probabilities. Some of course argue that if we plug in a pseudo-frequency and use the Bayesian formula (including some integral forms) for an output, then all one has is a pseudo-frequency masquerading as a frequency. However, it is possible to think about this situation differently. One hazards a guess as to the initial frequency -- perhaps based on expert opinion -- and then looks at whether the output frequency ratio is reasonable. That is, a Bayesian might argue that he is testing various initial values to see which yields an output that accords with observed facts.
One needn't always use the Bayesian formula to use this sort of reasoning.
Consider the probability of the word "transpire" in an example of what some would take as Bayesian reasoning. I am fairly sure it is possible, with much labor, to come up with empirical frequencies of that word that could be easily applied. But, from experience, I feel very confident in saying that far fewer than 1 in 10 books of the type I ordinarily read have had that word appear in the past. I also feel confident that a typical reader of books will agree with that assessment. So in that case, it is perfectly reasonable to plug in the value 0.1 in doing a combinatorial probability calculation for a recently read string of books. If, of the last 15 books I have read, 10 have contained the word "transpire," we have 10C15 x (1/10)10 x (9/10)5 = 1.77 x 10(-7). That is, the probability of such a string of books occurring nonrandomly is much less than 1 in 10 million.
This sort of "Bayesian" inference is especially useful when we wish to establish an upper bound of probability, which, as in the "transpire" case, may be all we need.
One may also argue for a "weight of evidence" model, which may or may not incorporate Bayes's theorem. Basically, the underlying idea is that new knowledge affects the probability of some outcome. Of course, this holds only if the knowledge is relevant, which requires "reasonableness" in specific cases, where a great deal of background information is necessary. But this doesn't mean that the investigator's experience won't be a reliable means of evaluating the information and arriving at a new probability, arguments of Fisher, Popper and others notwithstanding.
A "weight of evidence" approach of course is nothing but induction, and requires quite a bit of "subjective" expert opinion.
On this point, Keynes, in his Treatise, writes: "Take, for instance, the intricate network of arguments upon which the conclusions of The Origin of Species are founded: How impossible it would be to transform them into a shape in which they would be seen to rest upon statistical frequencies!" (28)
Mendelism and the statistical population genetics pioneered by J.B. Haldane, Sewall Wright and Fisher were still in the early stages when Keynes wrote this. And yet, Keynes's point is well taken. The expert opinion of Darwin the biologist was on the whole amply justified (29) when frequency-based methods based on discrete alleles became available (superseding much of the work of Francis Galton).
Three pioneers of the 'modern synthesis'
http://evolution.berkeley.edu/evolibrary/article/history_19
About Francis Galton
http://www.psych.utah.edu/gordon/Classes/Psy4905Docs/PsychHistory/Cards/Galton.html
Keynes notes that Darwin's lack of statistical or mathematical knowledge is notable and, in fact, a better use of frequencies would have helped him. Even so, Darwin did use frequencies informally. In fact, he was using his expert opinion as a student of biology to arrive at frequencies -- though not numerical ones, but rather rule-of-thumb inequalities of the type familiar to non-mathematical persons. From this empirico-inductive method, Darwin established various propositions, to which he gave informal credibility rankings. From these, he used standard logical implication, but again informally.
One must agree here with Popper's insight that the key idea comes first: Darwin's notion of natural selection was based on the template of artificial selection for traits in domestic animals, although he did not divine the driving force --eventually dubbed "survival of the fittest" -- behind natural selection until coming across a 1798 essay by Thomas Malthus.
Essay on the Principle of Population
http://www.ucmp.berkeley.edu/history/malthus.html
Keynes argues that the frequency of some observation and its probability should not be considered to be identical. (This led Carnap to define two forms of probability, though unlike Keynes, he was only interested in frequentist probability.) One may well agree that a frequency gives a number. Yet there must be some way of connecting it to degrees of belief that one ought to have. On the other hand, who actually has a degree of belief of 0.03791? Such a number is only valuable if it helps the observer to discriminate among inequalities, as in p(y) << p(x) < p(z).
One further point: The ordinary human mind-body system usually learns through an empirico-inductive frequency-payoff method, as I describe in Toward. So it makes sense that a true expert would have assimilated much knowledge into her autonomic systems, analogous to algorithms used in computing pattern detection and "auto-complete" systems. Hence one might argue that, at least in some cases, there is strong reason to view the "subjective" opinion as a good measuring rod. Of course, then we must ask, How reliable is the expert? And it would seem a frequency analysis of her predictions would be the way to go.
Studies of polygraph and fingerprint examiners have shown that in neither of those fields does there seem to be much in the way of corroboration that these forensic tools have any scientific value. At the very least, such studies show that the abilities of experts vary widely (30). This is an appropriate place to bring up the matter of the "prosecutor's fallacy," which I describe here:
The prosecutor's fallacy
http://kryptograff.blogspot.com/2007/07/probability-and-prosecutor-there-are.html
Here we run into the issue of false positives. A test can have a probability of accuracy of 99 percent, and yet the probability that that particular event is a match can have a very low probability. Take an example given by mathematician John Allen Paulos. Suppose a terrorist profile program is 99 percent accurate and let's say that 1 in a million Americans is a terrorist. That makes 300 terrorists. The program would be expected to catch 297 of those terrorists. However, the program has an error rate of 1 percent. One percent of 300 million Americans is 3 million people. So a data-mining operation would turn up some 3 million "suspects" who fit the terrorist profile but are innocent nonetheless. So the probability that a positive result identifies a real terrorist is 297 divided by 3 million, or about one in 30,000 -- a very low likelihood.
But data mining isn't the only issue. Consider biometric markers, such as a set of facial features, fingerprints or DNA patterns. The same rule applies. It may be that if a person was involved in a specific crime or other event, the biometric "print" will finger him or her with 99 percent accuracy. Yet context is all important. If that's all the cops have got, it isn't much. Without other information, the odds are still tens of thousands to one that the police or the Border Patrol have the wrong person.
The practicality of so-called Bayesian reasoning has been given by Enrico Fermi, who would ask his students to estimate how many piano tuners were working in Chicago. Certainly, one should be able to come up with plausible ballpark estimates based on subjective knowledge.
Conant on Enrico Fermi and a 9/11 plausibility test
http://znewz1.blogspot.com/2006/11/enrico-fermi-and-911-plausibility-test.html
I have also used the Poisson distribution for a Bayesian-style approach to the probability that wrongful executions have occurred in the United States.
Fatal flaws
http://znewz1.blogspot.com/2007/06/fatal-flaws.html
Some of my assumptions in those discussions are open to debate, of course.
More on Laplace's rule
The physicist Harold Jeffreys agrees with Keynes that the rule of succession isn't plausible without modification, that is via some initial probability. In fact the probability in the Laplacian result of (m+1)/(m+2) after one success is 2/3 that the next trial will succeed by this route -- which, for some experimental situations, Jeffreys regards as too low, rather than too high!
I find it interesting that economist Jevons's use of the Laplacian formula echoes the doomsday argument of Gott. Jevons observed that "if we suppose the sun to have risen demonstratively" one billion times, the probability that it will rise again, on the ground of this knowledge merely, is
(109 + 1) (109 + 2)However, notes Jevons, the probability it will rise a billion years hence is
(109 + 1) (2*109 + 2)or very close to 1/2.
Though one might agree with Jevons that this formula is a logical outcome of the empirico-inductivist method in science, it is the logic of a system taken to an extreme where, I suggest, it loses value. That is, the magnification of our measuring device is too big. A question of that sort is outside the scope of the tool. Of course, Jevons and his peers knew nothing of the paradoxes of Cantor and Russell, or of Goedel's remarkable results. But if the tool of probability theory -- whichever theory we're talking about -- is of doubtful value in the extreme cases, then a red flag should go up not only cautioning us that beyond certain boundaries "there be dragons," but also warning us that the foundations of existence may not really be explicable in terms of so-called blind chance.
In fact, Jevons does echo Keynes's view that extreme cases yield worthless quantifications, saying: "Inference pushed far beyond their data soon lose a considerable probability." Yet, we should note that the whole idea of the Laplacean rule is to arrive at probabilities when there is very little data available. I suggest that not only Jevons, but Keynes and other probability theorists, might have benefited from more awareness of set theory. That is, we have sets of primitive observations that are built up in the formation of the human social mind and from there, culture and science build sets of relations from these primitive sets.
So here we see the need to discriminate between a predictive algorithm based upon higher sets of relations (propensities of systems), versus a predictive algorithm that emulates the human mind's process of assessing predictability based on repetition, at first with close to zero system information (the newborn). And a third scenario is the use of probabilistic assessment in imperfectly predictive higher-level algorithms.
"We ought to always be applying the inverse method of probabilities so as to take into account all additional information," argues Jevons. This may or may not be true. If a system's propensities are very well established, it may be that variations from the mean should be regarded as observational errors and not indicative of a system malfunction.
"Events when closely scrutinized will hardly ever prove to be quite independent, and the slightest preponderance one way or the other is some evidence of connexion, and in the absence of better evidence should be taken into account," Jevons says (31).
First of all, two events of the same type are often beyond close scrutiny. But, what I think Jevons is really driving at is that when little is known about a dynamical system, the updating of probabilities with new information is a means of arriving at the system's propensities (biases). In other words, we have a rough method of assigning a preliminary information value to that system (we are forming a set of primitives), which can be used as a stopgap until such time as a predictive algorithm based on higher sets is agreed upon, even if that algorithm also requires probabilities for predictive power. Presumably, the predictive power is superior because the propensities have now been well established.
So we can say that the inverse method, and the rule of succession, is in essence a mathematical systemization of an intuitive process, which, however, tends to be also fine-gauged. By extension, much of the "scientific method" follows such a process, where the job of investigators is to progressively screen out "mere correlation" as well as to improve system predictability.
Or that is, a set based on primitive observations is "mere correlation" and so, as Pearson argues, this means that the edifice of science is built upon correlation, not cause. As Pearson points out, the notion of cause is very slippery, which is why he prefers the concept of correlation (32). However, he also had very little engagement with set theory. I would say that what we often carelessly regard as "causes" are to be found in the mathematics of sets.
~(A ∩ B) may be thought of as the cause of ~A ∪ ~B.
Of course, I have left out the time elements, as I only am giving a simple example. What I mean is that sometimes the relations among higher-order sets correspond to "laws" and "causes."
On Markov chains
Conditional probability of course takes on various forms when it is applied. Consider a Markov chain, which is considered far more "legitimate" than Laplace's rule.
Grinstead and Snell gives this example: The Land of Oz is a fine place but the weather isn't very good. Ozmonians never have two nice days in a row. "If they have a nice day, they are just as likely to have snow as rain the next day. If they have snow or rain, they have an even chance of having the same the next day. If there is change from snow or rain, only half of the time is this a change to a nice day."
With this information, a Markov chain can be obtained and a matrix of "transition probabilities" written.
Grinstead and Snell gives this theorem: Let P be the transition matrix of a Markov chain, and let u be the probability vector which represents the starting distribution. Then the probability that the chain is in state Si after n steps is the ith entry in the vector u(n) = uPn (33).
Grinstead and Snell chapter on Markov chains
http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter11.pdf
Wolfram MathWorld on Markov chain
http://mathworld.wolfram.com/MarkovChain.html
At least with a Markov process, the idea is to deploy non-zero propensity information, which is determined at some specified state of the system. Nevertheless, there is a question here as to what type of randomness is applicable. Where does one draw the line between subjective and objective in such a case? That depends on one's reality superstructure, as discussed later.
At any rate, it seems fair to say that what Bayesian algorithms, such as the rule of succession, tend to do is to justify via quantification our predisposition to "believe in" an event after multiple occurrences, a Darwinian trait we share with other mammals. Still, it should be understood that one is asserting one's psychological process in a way that "seems reasonable" but is at root faith-based and may be in error. More knowledge of physics may increase or decrease one's confidence, but intuitive assumptions remain faith-based.
It can be shown via logical methods that, as n rises, the opportunities for a Goldbach pair, in which n is summable by two primes, rise by approximately n2. So one might argue that the higher an arbitrary n, the less likely we are to find a counterexample. And computer checks verify this point.
Or one can use Laplace's rule of succession to show that the probability that the proposition holds for n is given by (n+1)/(n+2). In both cases, at infinity, we have probability 1, or "virtual certainty," that Goldbach's conjecture is true, and yet it might not be, unless we mean that the proposition is practically true because it is assumed that an exception occurs only occasionally. And yet, there remains the possibility that above some n, the behavior of the primes changes (there being so few). So we must even beware the idea that the probabilities are even meaningful over infinity.
At any rate, the confidence of mathematicians that the conjecture is true doesn't necessarily rise as n is pushed by ever more powerful computing. That's because no one has shown why no counterexample can occur. Now, one is entitled to act as though the conjecture is true. For example, one might include it in some practical software program.
A scientific method in the case of attacking public key cryptography is to use independent probabilities concerning primes as a way of escaping a great deal of factorization. One acts as though certain factorization conjectures are true, and that cuts the work involved. When such tests are applied several times, the probability of insufficient factorization drops considerably, meaning that a percentage of "uncrackable" factorizations will fall to this method.
As Keynes shrewdly observed, a superstition may well be the result of the empirical method of assigning a non-numerical probability based on some correlations. For example, when iron plows were first introduced into Poland, that development was followed by a succession of bad harvests, whereupon many farmers revived the use of wooden plowshares. In other words, they acted on the basis of a hypothesis that at the time seemed reasonable.
They also had a different theory of cause and effect than do we today, though even today correlation is frequently taken for causation. This follows from the mammalian psychosoma program that adopts the "survival oriented" theory that when an event often brings a positive or negative feeling, that event is the cause of the mammal's feeling of well-being.
Keynes notes the "common sense" that there is a "real existence of other people" may require an a priori assumption, an assumption that I would say implies the existence of a cognized, if denied, noumenal world. So the empirical, or inductive, notion that the real existence of a human being is "well established" we might say is circular.
Unlike many writers on the philosophy of science, Popper (34) rejected induction as a method of science. "And although I believe that in the history of science it is always the theory and not the experiment, always the idea and not the observation, which opens up the way to new knowledge, I also believe that it is the experiment which saves us from following a track that leads nowhere, which helps us out of the rut, and which challenges us to find a new way."
(Popper makes a good point that there are "diminishing returns of learning by induction." Because lim [m,n --> ∞ ] (m/n) = 1. That is, as more evidence piles up, its value decreases with the number of confirmations.)
A note on complexity
As it is to me inconceivable that a probabilistic scenario doesn't involve some dynamic system, it is evident that we construct a theory -- which in some disciplines is a mathematically based algorithm or set of algorithms for making predictions. The system with which we are working has initial value information and algorithmic program information. This information is non-zero and tends to yield propensities, or initial biases. However, the assumptions or primitive notions in the theory either derive from a subsidiary formalism or are found by empirical means; these primitives derive from experiential -- and hence unprovable -- frequency ratios.
I prefer to view simplicity of a theory as a "small" statement (which may be nested inside a much larger statement). From the algorithmic perspective, we might say that number of parameters is equivalent to number of input values, or, better, that the simplicity corresponds to the information in the algorithm design and input. Simplicity and complexity may be regarded as two ends of some spectrum of binary string lengths.
Another way to view complexity is similar to the Chaitin algorithmic information ratio, but distinct. In this case, we look at the Shannon redundancy versus the Shannon total information.
So the complexity of a signal -- which could be the mathematical representation of a physical system -- would then not be found in the maximum information entailed by equiprobability of every symbol. The structure in the mathematical representation implies constraints -- or conditional probabilities for symbols. So then maximum structure is found when symbol A strictly implies symbol B in a binary system, which is tantamount to saying A = B, giving the uninteresting string: AA...A.
Maximum structure then violates our intuitive idea of complexity. So what do we mean by complexity in this sense?
A point that arises in such discussions concerns entropy (the tendency toward decrease of order) and the related idea of information, which is sometimes thought of as the surprisal value of a digit string. Sometimes a pattern such as AA...A is considered to have low information because we can easily calculate the nth value (assuming we are using some algorithm to obtain the string). So the Chaitin-Kolmogorov complexity is low, or that is, the information is low. On the other hand a string that by some measure is effectively random is considered here to be highly informative because the observer has almost no chance of knowing the string in detail in advance.
Leon Brillouin in Science and Information Theory gives a thorough and penetrating discussion of information theory and physical entropy. Physical entropy he regards as a special case under the heading of information theory (32aa).
Shannon's idea of maximum entropy for a bit string means that it has no redundancy, and so potentially carries the maximum amount of new information. This concept oddly ties together maximally random with maximally informative. It might help to think of the bit string as a carrier of information. Yet, because we screen out the consumer, there is no practical difference between the "actual" information value and the "potential" information value, which is why no one bothers with the "carrier" concept.
However, we can also take the opposite tack. Using runs testing, most digit strings (multi-value strings can often be transformed, for test purposes, to bi-value strings) are found under the bulge in the runs test bell curve and represent probable randomness. So it is unsurprising to encounter such a string. It is far more surprising to come across a string with far "too few" or far "too many" runs. These highly ordered strings would then, from this persepctive, be considered to have high information value because possibly indicative of a non-random organizing principle.
This distinction may help address Stephen Wolfram's attempt to cope with "highly complex" automata (32a). By these, he means those with irregular, randomlike stuctures running through periodic "backgrounds" (sometimes called "ether"). If a sufficiently long runs test were done on such automata, we would obtain, I suggest, z scores in the high but not outlandish range. The z score would give a gauge of complexity.
We might distinguish complicatedness from complexity by saying that a random-like permutation of our grammatical symbols is merely complicated, but a grammatical permutation, taking into account transmission error, is complex.
In this respect, we might also construe complexity as a measure of coding efficiency.
So we know that "complexity" is a worthwhile concept, to be distinguished -- at times -- from "complicatedness." We would say that something that is maximally complicated has the quality of binomial "randomness;" it resides with the largest sets of combinations found in the 68% zone.
I suggest that we may as well define maximally complex to mean a constraint set yielding 50% redundancy in Shannon information. That is, I' = I - Ic, where I' is the new information, I the maximum information that occurs when all symbols are equiprobable (zero structural or propensity information).
Consider two specific primes that are multiplied to form a composite. The names of the primes and the multiplication algorithm. This may be given an advance information value Ic. Alice, who is doing the computation, has this "information," but doesn't know what the data stream will look like when the composite is computed. But she would be able to estimate the stream's approximate length and might know that certain substrings are very likely, or certain. That is, she has enough advance information to devise conditional probabilities for the characters.
Bob encounters the data string and wishes to decipher it. He lacks part of Ic: the names of the primes. So there is more information in the string for him than for Alice. He learns more once he deciphers it than does she, who needn't decipher.
In this respect we see that for him the characters are closer to equiprobability, or maximum Shannon entropy, than for Alice. For him, the amount of information is strongly correlated to the algorithmic work involved. After finding the square root, his best algorithm -- if he wants to be certain of obtaining the primes -- is the sieve of Eratosthenes. This is considered a "hard" computing problem as the work increases exponentially with n.
On the other hand, if Alice wants to compute pn x pm, her work increases close to linearly.
A string with maximum Shannon entropy means that the work of decipherment is very close to kn, where k represents the base k number system.
We see then that algorithmic information and standard Shannon information are closely related by the concept of computing work.
Another way to view complexity is via autocorrelation. So an autocorrelation coefficient near 1 or -1 can be construed to imply high "order." As Wikipedia notes, autocorrelation, also known as serial correlation, is the cross-correlation of a signal with itself. Informally, it is the similarity between observations as a function of the time lag between them. It is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.
Multidimensional autocorrelation can also be used as a gauge of complexity. However, it would seem that any multidimensional signal could be mapped onto a two-dimensional signal graph. (I concede I should look into this further at some point.) But, we see that the correlation coefficient, whether auto or no, handles randomness in a way that is closely related to the normal curve. Hence, the correlation coefficient for something highly complex would fall somewhere near 1 or -1, but not too close, because, in general, extreme order is rather uncomplicated.
One can see that the autorcorrelation coefficient is a reflection of Shannon's redundancy quantity. (I daresay there is an expression equating or nearly equating the two.)
When checking the randomness of a signal, the autocorrelation lag time is usually put at 1, according to the National Institute of Standards and Technology, which relates the following:
Given measurements, Y1, Y2, ..., YN at time X1, X2, ..., XN, the lag k autocorrelation function is defined as
rk =
Σ N-ki=1 (Yi - Y')
----------------------
Σ Ni=1 (Yi - Y')2
with Y' representing the mean of the Y values.
Although the time variable, X, is not used in the formula for autocorrelation, the assumption is that the observations are equi-spaced.
Wikipedia article on autocorrelation
http://en.wikipedia.org/wiki/Autocorrelation
NIST article on autocorrelation
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35c.htm"
In another vein, consider the cross product A X B of phenomena in A related to B, such that a is a member of A and b is a member of B and aRb means a followed by b and the equivalence relation applies, such that the relation is reflexive, symmetric and transitive.
One algorithm may obtain a smaller subset of A X B than does another. The superior algorithm fetches the larger subset, with the caveat that an "inferior" algorithm may be preferred because its degree of informational complexity is lower than that of the "superior" algorithm.
One might say that algorithm X has more "explanatory power" than algorithm Y if X obtains a larger subset of A X B than does Y and, depending on one's inclination, if X also entails "substantially less" work than does Y.
The method of science works about like the technique in bringing out a logic proof via several approximations. Insight can occur once an approximation is completed and the learner is then prepared for the next approximation or final proof.
This is analogous to deciphering a lengthy message. One may have hard information, or be required to speculate, about a part of the cipher. One then progresses -- hopefully -- as the new information helps unravel the next stage. That is, the information in the structure (or, to use Shannon's term, in the redundancy) is crucial to the decipherment. Which is to say that a Bayesian style of thinking is operative. New information alters probabilities assigned certain substrings.
Decipherment of a coded or noisy message is a pretty good way of illustrating why a theory might be considered valid. Once part of the "message" has been analyzed as having a fair probability of meaning X, the scientist ("decoder") uses that provisional information, along with any external information at hand, to make progress in reading the message. Once a nearly complete message/theory is revealed, the scientist/decoder and her associates believe they have cracked the "code" based on the internal consistency of their findings (the message).
In the case of science in general, however, no one knows how long the message is, or what would assuredly constitute "noise" in the signal (perhaps, a priori wrong ideas?). So the process is much fuzzier than the code cracker's task.
Interestingly, Alan Turing and his colleagues used Bayesian conditional probabilities as part of their decipherment program, establishing that such methods, whatever the logical objections, work quite well in some situations. However, though the code-cracking analogy is quite useful, it seems doubtful that one could use some general method of assigning probabilities -- whether of the Turing or Shannon variety -- to scientific theories, other than possibly to toy models.
Scientists usually prefer to abstain from metaphysics, but their method clearly begs the question: "If the universe is the signal, what is the transmitter?" or "Can the message transmit itself?" Another fair question is: "If the universe is the message, can part of the message (we humans) read the message fully?"
We have a problem of representation when we pose the question: "Can science find an algorithm that, in principle, simulates the entire universe?" The answer is that no Turing machine can model the entire universe.
Conant on Hilbert's sixth problem
LINK TO PART III IN SIDEBAR
15. A Treatise on Probability by J.M. Keynes (Macmillan, 1921).
16. Treatise, Keynes.
17. Treatise, Keynes.
18. Treatise, Keynes.
19. Grundbegriffe der Wahrscheinlichkeitsrechnung (in German) by Andrey Kolmogorov (Julius Springer, 1933). Foundations of the Theory of Probability (in English) was published by Chelsea Publishing Co. in 1950.
20. Popper's propensity theory is discussed in his Postscript to The Logic of Scientific Discovery, which was published in three volumes as:
a. Realism and the Aim of Science, Postscript Volume I, Routledge, 1985. Hutchinson, 1983.21. Symmetry by Hermann Weyl (Princeton, 1952).
b. The Open Universe, Postscript Volume II Routledge, 1988. Hutchinson, 1982.
c. Quantum Theory and the Schism in Physics, Postscript Volume III, Routledge, 1989. Hutchinson, 1982.
21a. Treatise, Keynes.
22. Theory of Probability by Harold Jeffreys (Oxford/Clarendon Third edition 1961; originally published in 1939)
23. Gamma: Exploring Euler's Constant by Julian Havil (Princeton, 2003).
24. Interpreting Probability: controversies and developments of the early 20th century by David Howie (Cambridge 2002).
25. The Logic of Scientific Discovery by Karl Popper. Published as Logik der Forschung in 1935; English version published by Hutchinson in 1959.
26. Popper's propensity theory is discussed in his Postscript to The Logic of Scientific Discovery, which was published in three volumes as:
a. Realism and the Aim of Science, Postscript Volume I, Routledge, 1985. Hutchinson, 1983.27. The Signal and the Noise: Why So Many Predictions Fail But Some Don't by Nate Silver (Penguin 2012).
b. The Open Universe, Postscript Volume II Routledge, 1988. Hutchinson, 1982.
c. Quantum Theory and the Schism in Physics, Postscript Volume III, Routledge, 1989. Hutchinson, 1982.
In his book, Silver passively accepts the official yarn about the events of 9/11, and makes an egregious error in the logic behind his statistical discussion of terrorist acts.
Amazing blunder drowns out 'Signal'
http://conantcensorshipissue.blogspot.com/2013/04/amazing-blunder-drowns-out-signal-on.html
For a kinder review, here is
John Allen Paulos on Silver's book
http://www.washingtonpost.com/opinions/the-signal-and-the-noise-why-so-many-predictions-fail--but-some-dont-by-nate-silver/2012/11/09/620bf2d0-0671-11e2-a10c-fa5a255a9258_story.html
28. A Treatise on Probability by J.M. Keynes (Macmillan, 1921).
29. Treatise, Keynes.
30. Calculated Risks: How to know when numbers deceive you by
Gerd Gigerenzer (Simon and Schuster 2002).
31. The Principles of Science (Vol I) by William Stanley Jevons (Routledge/Thoemmes Press, 1996 reprint of 1874 ms).
32. The Grammar of Science by Karl Pearson (Meridian 1957 reprint of 1911 revised edition).
32aa. Science and Information Theory, Second Edition, by Leon Brillouin (Dover 2013 reprint of Academic Press 1962 edition; first edition, 1956).
32a. A New Kind of Science by Stephen Wolfram (wolfram Media, 2002).
33. Charles M. Grinstead and Laurie J. Snell Introduction to Probability, Second Edition (American Mathematical Society 1997).
34. The Logic of Scientific Discovery by Karl Popper.
35. Probability by Mark Kac in "The Mathematical Sciences -- A collection of essays" (MIT 1969). The essay appeared originally in Scientific American, Vol. 211, No. 3, 1964.
No comments:
Post a Comment