Ended: March 24, 2018
On its face Bayes’ rule is a simple, one-line theorem: by updating our initial belief about something with objective new information, we get a new and improved belief. To its adherents, this is an elegant statement about learning from experience. Generations of converts remember experiencing an almost religious epiphany as they fell under the spell of its inner logic. Opponents, meanwhile, regarded Bayes’ rule as subjectivity run amok.
At its heart, Bayes runs counter to the deeply held conviction that modern science requires objectivity and precision. Bayes is a measure of belief. And it says that we can learn even from missing and inadequate data, from approximations, and from ignorance.
when the Scottish philosopher David Hume published an essay attacking some of Christianity’s fundamental narratives. Hume believed that we can’t be absolutely certain about anything that is based only on traditional beliefs, testimony, habitual relationships, or cause and effect. In short, we can rely only on what we learn from experience. Because God was regarded as the First Cause of everything, Hume’s skepticism about cause-and-effect relationships was especially unsettling. Hume argued that certain objects are constantly associated with each other. But the fact that umbrellas and rain appear together does not mean that umbrellas cause rain. The fact that the sun has risen thousands of times does not guarantee that it will do so the next day.
Napoleon complained that “Newton spoke of God in his book. I have perused yours but failed to find His name even once. Why?” “Sire,” Laplace replied magisterially, “I have no need of that hypothesis.”
Awash in newly collected data, the revisionists preferred to judge the probability of an event according to how frequently it occurred among many observations. Eventually, adherents of this frequency-based probability became known as frequentists or sampling theorists.
When critics thought of Bayes’ rule, they thought of it as Laplace’s rule and focused their criticism on him and his followers. Arguing that probabilities should be measured by objective frequencies of events rather than by subjective degrees of belief, they treated the two approaches as opposites, although Laplace had considered them basically equivalent.
In a remarkable confluence of thinking, three men in three different countries independently came up with the same idea about Bayes: knowledge is indeed highly subjective, but we can quantify it with a bet. The amount we wager shows how much we believe in something.
Savage was confronting the thorniest objection to Bayesian methods: “If prior opinions can differ from one researcher to the next, what happens to scientific objectivity in data analysis?”17 Elaborating on Jeffreys, Savage answered as follows: as the amount of data increases, subjectivists move into agreement, the way scientists come to a consensus as evidence accumulates about, say, the greenhouse effect or about cigarettes being the leading cause of lung cancer. When they have little data, scientists disagree and are subjectivists; when they have piles of data, they agree and become objectivists. Lindley agreed: “That’s the way science is done.”
According to the likelihood principle, all the information in experimental data gets encapsulated in the likelihood portion of Bayes’ theorem, the part describing the probability of objective new data; the prior played no role. Practically speaking, the principle greatly streamlined analysis. Scientists could stop running an experiment when they were satisfied with the result or ran out of time, money, and patience; nonBayesians had to continue until some frequency criterion was met. Bayesians would also be able to concentrate on what happened, not on what could have happened according to Neyman-Pearson’s sampling plan.
As far as Savage was concerned, Bayes’ rule filled a need that other statistical procedures could not. Frequentism’s origin in genetics and biology meant it was involved with group phenomena, populations, and large aggregations of similar objects. As for using statistical methods in biology or physics, the Nobel Prize–winning physicist Erwin Schrödinger said, “The individual case [is] entirely devoid of interest.”22 Bayesians like Savage, though, could work with isolated one-time events, such as the probability that a chair weighs 20 pounds, that a plane would be late, or that the United States would be at war in five years.
Bayesians could also combine information from different sources, treat observables as random variables, and assign probabilities to all of them, whether they formed a bell-shaped curve or some other shape. Bayesians used all their available data because each fact could change the answer by a small amount. Frequency-based statisticians threw up their hands when Savage inquired whimsically, “Does whiskey do more harm than good in the treatment of snake bite?” Bayesians grinned and retorted, “Whiskey probably does more harm than good.”
A compromise between Bayesian and anti-Bayesian methods began to look attractive. The idea was to estimate the initial probabilities according to their relative frequency and then proceed with the rest of Bayes’ rule. Empirical Bayes, as it was called, seemed like a breakthrough.
He accepted the validity of both kinds of probability: probability as degrees of belief and probability as relative frequency.
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” Tukey publicized how even slight deviations from the normal model could muddle the methods of Fisher, Neyman, and Egon Pearson.
When Smith spoke at a workshop in Quebec in June 1989, he showed that Markov chain Monte Carlo could be applied to almost any statistical problem. It was a revelation. Bayesians went into “shock induced by the sheer breadth of the method.”12 By replacing integration with Markov chains, they could finally, after 250 years, calculate realistic priors and likelihood functions and do the difficult calculations needed to get posterior probabilities.
In this ecumenical atmosphere, two longtime opponents—Bayes’ rule and Fisher’s likelihood approach—ended their cold war and, in a grand synthesis, supported a revolution in modeling. Many of the newer practical applications of statistical methods are the results of this truce. As a collection of computational and statistical machinery, Bayes is still driven by Bayes’ rule. The word “Bayes” still entails the idea, shared by de Finetti, Ramsey, Savage, and Lindley, that probability is a measure of belief and that it can, as Lindley put it, “escape from repetition to uniqueness.” That said, most modern Bayesians accept that the frequentism of Fisher, Neyman, and Egon Pearson is still effective for most statistical problems: for simple and standard analyses, for checking how well a hypothesis fits data, and as the foundation of many modern technologies in areas such as machine learning.
Harsanyi often used Bayes to study competitive situations where people have incomplete or uncertain information about each other or about the rules. Harsanyi also showed that Nash’s equilibrium for games with incomplete or imperfect information was a form of Bayes’ rule.
Psychologists Amos Tversky, who died before the prize was awarded, and Daniel Kahneman showed that people do not make decisions according to rational Bayesian procedures. People answer survey questions depending on their phrasing, and physicians choose surgery or radiation for cancer patients depending on whether the treatments are described in terms of mortality or survival rates.
Naïve Bayes assumes simplistically that every variable is independent of the others; thus, a patient’s fever and elevated white blood cell counts are treated as if they had nothing to do with each other. According to Google’s research director, Peter Norvig, “There must have been dozens of times when a project started with naïve Bayes, just because it was easy to do and we expected to replace it with something more sophisticated later, but in the end the vast amount of data meant that a more complex technique was not needed.”