Algorithms to Live By: The Computer Science of Human Decisions
Brian Christian Tom Griffiths

Ended: May 9, 2016

(A mathematician might say you have access only to the ordinal numbers—the relative ranks of the applicants compared to each other—but not to the cardinal numbers, their ratings on some kind of general scale.)
He didn’t know how many women he could expect to meet in his lifetime, but there’s a certain flexibility in the 37% Rule: it can be applied to either the number of applicants or the time over which one is searching. Assuming that his search would run from ages eighteen to forty, the 37% Rule gave age 26.1 years as the point at which to switch from looking to leaping. A number that, as it happened, was exactly Trick’s age at the time. So when he found a woman who was a better match than all those he had dated so far, he knew exactly what to do. He leapt. “I didn’t know if she was Perfect (the assumptions of the model don’t allow me to determine that), but there was no doubt that she met the qualifications for this step of the algorithm. So I proposed,” he writes. “And she turned me down.”
Both Kepler and Trick—in opposite ways—experienced firsthand some of the ways that the secretary problem oversimplifies the search for love. In the classical secretary problem, applicants always accept the position, preventing the rejection experienced by Trick. And they cannot be “recalled” once passed over, contrary to the strategy followed by Kepler.
The possibility of rejection, for instance, has a straightforward mathematical solution: propose early and often. If you have, say, a 50/50 chance of being rejected, then the same kind of mathematical analysis that yielded the 37% Rule says you should start making offers after just a quarter of your search.
The University of Wisconsin–Madison’s Laura Albert McLay, an optimization expert, recalls turning to her knowledge of optimal stopping problems when it came time to sell her own house. “The first offer we got was great,” she explains, “but it had this huge cost because they wanted us to move out a month before we were ready. There was another competitive offer…[but] we just kind of held out until we got the right one.” For many sellers, turning down a good offer or two can be a nerve-racking proposition, especially if the ones that immediately follow are no better. But McLay held her ground and stayed cool. “That would have been really, really hard,” she admits, “if I didn’t know the math was on my side.”
It’s something of a policy reminder to municipal governments: parking is not as simple as having a resource (spots) and maximizing its utilization (occupancy). Parking is also a process—an optimal stopping problem—and it’s one that consumes attention, time, and fuel, and generates both pollution and congestion.
but there are sequential decision-making problems for which there is no optimal stopping rule. A simple example is the game of “triple or nothing.” Imagine you have $1.00, and can play the following game as many times as you want: bet all your money, and have a 50% chance of receiving triple the amount and a 50% chance of losing your entire stake. How many times should you play? Despite its simplicity, there is no optimal stopping rule for this problem, since each time you play, your average gains are a little higher.
The math shows that you should always keep playing. But if you follow this strategy, you will eventually lose everything. Some problems are better avoided than solved.
expect to pass through this world but once. Any good therefore that I can do, or any kindness that I can show to any fellow creature, let me do it now. Let me not defer or neglect it, for I shall not pass this way again. —STEPHEN GRELLET
Whether it involves secretaries, fiancé(e)s, or apartments, life is full of optimal stopping. So the irresistible question is whether—by evolution or education or intuition—we actually do follow the best strategies. At first glance, the answer is no. About a dozen studies have produced the same result: people tend to stop early, leaving better applicants unseen.
This type of cost offers a potential explanation for why people stop early when solving a secretary problem in the lab. Seale and Rapoport showed that if the cost of seeing each applicant is imagined to be, for instance, 1% of the value of finding the best secretary, then the optimal strategy would perfectly align with where people actually switched from looking to leaping in their experiment.
English, the words “explore” and “exploit” come loaded with completely opposite connotations. But to a computer scientist, these words have much more specific and neutral meanings. Simply put, exploration is gathering information, and exploitation is using the information you have to get a known good result.
In computer science, the tension between exploration and exploitation takes its most concrete form in a scenario called the “multi-armed bandit problem.” The odd name comes from the colloquial term for a casino slot machine, the “one-armed bandit.” Imagine walking into a casino full of different slot machines, each one with its own odds of a payoff. The rub, of course, is that you aren’t told those odds in advance: until you start playing, you won’t have any idea which machines are the most lucrative (“loose,” as slot-machine aficionados call it) and which ones are just money sinks.
“Carpe diem,” urges Robin Williams in one of the most memorable scenes of the 1989 film Dead Poets Society. “Seize the day, boys. Make your lives extraordinary.” It’s incredibly important advice. It’s also somewhat self-contradictory. Seizing a day and seizing a lifetime are two entirely different endeavors. We have the expression “Eat, drink, and be merry, for tomorrow we die,” but perhaps we should also have its inverse: “Start learning a new language or an instrument, and make small talk with a stranger, because life is long, and who knows what joy could blossom over many years’ time.” When balancing favorite experiences and new ones, nothing matters as much as the interval over which we plan to enjoy them.
“I’m more likely to try a new restaurant when I move to a city than when I’m leaving it,” explains data scientist and blogger Chris Stucchio, a veteran of grappling with the explore/exploit tradeoff in both his work and his life. “I mostly go to restaurants I know and love now, because I know I’m going to be leaving New York fairly soon. Whereas a couple years ago I moved to Pune, India, and I just would eat friggin’ everywhere that didn’t look like it was gonna kill me.
From a studio’s perspective, a sequel is a movie with a guaranteed fan base: a cash cow, a sure thing, an exploit. And an overload of sure things signals a short-termist approach, as with Stucchio on his way out of town. The sequels are more likely than brand-new movies to be hits this year, but where will the beloved franchises of the future come from? Such a sequel deluge is not only lamentable (certainly critics think so); it’s also somewhat poignant. By entering an almost purely exploit-focused phase, the film industry seems to be signaling a belief that it is near the end of its interval. A look into the economics of Hollywood confirms this hunch. Profits of the largest film studios declined by 40% between 2007 and 2011, and ticket sales have declined in seven of the past ten years. As the Economist puts it, “Squeezed between rising costs and falling revenues, the big studios have responded by trying to make more films they think will be hits: usually sequels, prequels, or anything featuring characters with name recognition.” In other words, they’re pulling the arms of the best machines they’ve got before the casino turns them out.
Gittins (albeit many years before the first episode of Deal or No Deal aired) realized that the multi-armed bandit problem is no different. For every slot machine we know little or nothing about, there is some guaranteed payout rate which, if offered to us in lieu of that machine, will make us quite content never to pull its handle again. This number—which Gittins called the “dynamic allocation index,” and which the world now knows as the Gittins index—suggests an obvious strategy on the casino floor: always play the arm with the highest index.
The Gittins index, then, provides a formal, rigorous justification for preferring the unknown, provided we have some opportunity to exploit the results of what we learn from exploring. The old adage tells us that “the grass is always greener on the other side of the fence,” but the math tells us why: the unknown has a chance of being better, even if we actually expect it to be no different, or if it’s just as likely to be worse. The untested rookie is worth more (early in the season, anyway) than the veteran of seemingly equal ability, precisely because we know less about him. Exploration in itself has value, since trying new things increases our chances of finding the best. So taking the future into account, rather than focusing just on the present, drives us toward novelty.
Regret can also be highly motivating. Before he decided to start, Jeff Bezos had a secure and well-paid position at the investment company D. E. Shaw & Co. in New York. Starting an online bookstore in Seattle was going to be a big leap—something that his boss (that’s D. E. Shaw) advised him to think about carefully. Says Bezos: The framework I found, which made the decision incredibly easy, was what I called—which only a nerd would call—a “regret minimization framework.” So I wanted to project myself forward to age 80 and say, “Okay, now I’m looking back on my life. I want to have minimized the number of regrets I have.” I knew that when I was 80 I was not going to regret having tried this. I was not going to regret trying to participate in this thing called the Internet that I thought was going to be a really big deal. I knew that if I failed I wouldn’t regret that, but I knew the one thing I might regret is not ever having tried. I knew that that would haunt me every day, and so, when I thought about it that way it was an incredibly easy decision.
Like the Gittins index, therefore, Upper Confidence Bound algorithms assign a single number to each arm of the multi-armed bandit. And that number is set to the highest value that the arm could reasonably have, based on the information available so far. So an Upper Confidence Bound algorithm doesn’t care which arm has performed best so far; instead, it chooses the arm that could reasonably perform best in the future. If you have never been to a restaurant before, for example, then for all you know it could be great. Even if you have gone there once or twice, and tried a couple of their dishes, you might not have enough information to rule out the possibility that it could yet prove better than your regular favorite. Like the Gittins index, the Upper Confidence Bound is always greater than the expected value, but by less and less as we gain more experience with a particular option. (A restaurant with a single mediocre review still retains a potential for greatness that’s absent in a restaurant with hundreds of such reviews.) The recommendations given by Upper Confidence Bound algorithms will be similar to those provided by the Gittins index, but they are significantly easier to compute, and they don’t require the assumption of geometric discounting.
Upper Confidence Bound algorithms implement a principle that has been dubbed “optimism in the face of uncertainty.” Optimism, they show, can be perfectly rational. By focusing on the best that an option could be, given the evidence obtained so far, these algorithms give a boost to possibilities we know less about. As a consequence, they naturally inject a dose of exploration into the decision-making process, leaping at new options with enthusiasm because any one of them could be the next big thing. The same principle has been used, for instance, by MIT’s Leslie Kaelbling, who builds “optimistic robots” that explore the space around them by boosting the value of uncharted terrain. And it clearly has implications for human lives as well.
The success of Upper Confidence Bound algorithms offers a formal justification for the benefit of the doubt. Following the advice of these algorithms, you should be excited to meet new people and try new things—to assume the best about them, in the absence of evidence to the contrary. In the long run, optimism is the best prevention for regret.
you’ve used the Internet basically at all over the past decade, then you’ve been a part of someone else’s explore/exploit problem. Companies want to discover the things that make them the most money while simultaneously making as much of it as they can—explore, exploit. Big tech firms such as Amazon and Google began carrying out live A/B tests on their users starting in about 2000, and over the following years the Internet has become the world’s largest controlled experiment. What are these companies exploring and exploiting? In a word, you: whatever it is that makes you move your mouse and open your wallet.
Within a decade or so after its first tentative use, A/B testing was no longer a secret weapon. It has become such a deeply embedded part of how business and politics are conducted online as to be effectively taken for granted. The next time you open your browser, you can be sure that the colors, images, text, perhaps even the prices you see—and certainly the ads—have come from an explore/exploit algorithm, tuning itself to your clicks. In this particular multi-armed bandit problem, you’re not the gambler; you’re the jackpot.
Between 1932 and 1972, several hundred African-American men with syphilis in Macon County, Alabama, went deliberately untreated by medical professionals, as part of a forty-year experiment by the US Public Health Service known as the Tuskegee Syphilis Study. In 1966, Public Health Service employee Peter Buxtun filed a protest. He filed a second protest in 1968. But it was not until he broke the story to the press—it appeared in the Washington Star on July 25, 1972, and was the front-page story in the New York Times the next day—that the US government finally halted the study. What followed the public outcry, and the subsequent congressional hearing, was an initiative to formalize the principles and standards of medical ethics. A commission held at the pastoral Belmont Conference Center in Maryland resulted in a 1979 document known as the Belmont Report. The Belmont Report lays out a foundation for the ethical practice of medical experiments, so that the Tuskegee experiment—an egregious, unambiguously inappropriate breach of the health profession’s duty to its patients—might never be repeated.
From 1982 to 1984, Bartlett and his colleagues at the University of Michigan performed a study on newborns with respiratory failure. The team was clear that they wanted to address, as they put it, “the ethical issue of withholding an unproven but potentially lifesaving treatment,” and were “reluctant to withhold a lifesaving treatment from alternate patients simply to meet conventional random assignment technique.” Hence they turned to Zelen’s algorithm. The strategy resulted in one infant being assigned the “conventional” treatment and dying, and eleven infants in a row being assigned the experimental ECMO treatment, all of them surviving. Between April and November of 1984, after the end of the official study, ten additional infants met the criteria for ECMO treatment. Eight were treated with ECMO, and all eight survived. Two were treated conventionally, and both died.
The widespread difficulty with accepting results from adaptive clinical trials might seem incomprehensible. But consider that part of what the advent of statistics did for medicine, at the start of the twentieth century, was to transform it from a field in which doctors had to persuade each other in ad hoc ways about every new treatment into one where they had clear guidelines about what sorts of evidence were and were not persuasive. Changes to accepted standard statistical practice have the potential to upset this balance, at least temporarily.
In general, it seems that people tend to over-explore—to favor the new disproportionately over the best.
The standard multi-armed bandit problem assumes that the probabilities with which the arms pay off are fixed over time. But that’s not necessarily true of airlines, restaurants, or other contexts in which people have to make repeated choices. If the probabilities of a payoff on the different arms change over time—what has been termed a “restless bandit”—the problem becomes much harder. (So much harder, in fact, that there’s no tractable algorithm for completely solving it, and it’s believed there never will be.) Part of this difficulty is that it is no longer simply a matter of exploring for a while and then exploiting: when the world can change, continuing to explore can be the right choice. It might be worth going back to that disappointing restaurant you haven’t visited for a few years, just in case it’s under new management.
Thinking about children as simply being at the transitory exploration stage of a lifelong algorithm might provide some solace for parents of preschoolers. (Tom has two highly exploratory preschool-age daughters, and hopes they are following an algorithm that has minimal regret.) But it also provides new insights about the rationality of children. Gopnik points out that “if you look at the history of the way that people have thought about children, they have typically argued that children are cognitively deficient in various ways—because if you look at their exploit capacities, they look terrible. They can’t tie their shoes, they’re not good at long-term planning, they’re not good at focused attention. Those are all things that kids are really awful at.” But pressing buttons at random, being very interested in new toys, and jumping quickly from one thing to another are all things that kids are really great at. And those are exactly what they should be doing if their goal is exploration. If you’re a baby, putting every object in the house into your mouth is like studiously pulling all the handles at the casino.
Being sensitive to how much time you have left is exactly what the computer science of the explore/exploit dilemma suggests. We think of the young as stereotypically fickle; the old, stereotypically set in their ways. In fact, both are behaving completely appropriately with respect to their intervals. The deliberate honing of a social network down to the most meaningful relationships is the rational response to having less time to enjoy them.
Perhaps the deepest insight that comes from thinking about later life as a chance to exploit knowledge acquired over decades is this: life should get better over time. What an explorer trades off for knowledge is pleasure. The Gittins index and the Upper Confidence Bound, as we’ve seen, inflate the appeal of lesser-known options beyond what we actually expect, since pleasant surprises can pay off many times over. But at the same time, this means that exploration necessarily leads to being let down on most occasions. Shifting the bulk of one’s attention to one’s favorite things should increase quality of life. And it seems like it does: Carstensen has found that older people are generally more satisfied with their social networks, and often report levels of emotional well-being that are higher than those of younger adults.
When then senator Obama visited Google in 2007, CEO Eric Schmidt jokingly began the Q&A like a job interview, asking him, “What’s the best way to sort a million thirty-two-bit integers?” Without missing a beat, Obama cracked a wry smile and replied, “I think the Bubble Sort would be the wrong way to go.” The crowd of Google engineers erupted in cheers. “He had me at Bubble Sort,” one later recalled. Obama was right to eschew Bubble Sort, an algorithm which has become something of a punching bag for computer science students: it’s simple, it’s intuitive, and it’s extremely inefficient.
The program that John von Neumann wrote in 1945 to demonstrate the power of the stored-program computer took the idea of collating to its beautiful and ultimate conclusion. Sorting two cards is simple: just put the smaller one on top. And given a pair of two-card stacks, both of them sorted, you can easily collate them into an ordered stack of four. Repeating this trick a few times, you’d build bigger and bigger stacks, each one of them already sorted. Soon enough, you could collate yourself a perfectly sorted full deck—with a final climactic merge, like a riffle shuffle’s order-creating twin, producing the desired result. This approach is known today as Mergesort, one of the legendary algorithms in computer science. As a 1997 paper put it, “Mergesort is as important in the history of sorting as sorting in the history of computing.”
In an important sense, the O(n log n) linearithmic time offered by Mergesort is truly the best we can hope to achieve. It’s been proven that if we want to fully sort n items via a series of head-to-head comparisons, there’s just no way to compare them any fewer than O(n log n) times. It’s a fundamental law of the universe, and there are no two ways around it. But this doesn’t, strictly speaking, close the book on sorting. Because sometimes you don’t need a fully ordered set—and sometimes sorting can be done without any item-to-item comparisons at all. These two principles, taken together, allow for rough practical sorts in faster than linearithmic time. This is beautifully demonstrated by an algorithm known as Bucket Sort—of which the Preston Sort Center is a perfect illustration. In Bucket Sort, items are grouped together into a number of sorted categories, with no regard for finer, intracategory sorting; that can come later. (In computer science the term “bucket” simply refers to a chunk of unsorted data, but some of the most powerful real-world uses of Bucket Sort, as at the KCLS, take the name entirely literally.) Here’s the kicker: if you want to group n items into m buckets, the grouping can be done in O(nm) time—that is, the time is simply proportional to the number of items times the number of buckets. And as long as the number of buckets is relatively small compared to the number of items, Big-O notation will round that to O(n), or linear time.
Knowing all these sorting algorithms should come in handy next time you decide to alphabetize your bookshelf. Like President Obama, you’ll know not to use Bubble Sort. Instead, a good strategy—ratified by human and machine librarians alike—is to Bucket Sort until you get down to small enough piles that Insertion Sort is reasonable, or to have a Mergesort pizza party.
Much as we bemoan the daily rat race, the fact that it’s a race rather than a fight is a key part of what sets us apart from the monkeys, the chickens—and, for that matter, the rats.
interesting to note that NCAA’s March Madness tournament is consciously designed to mitigate this flaw in its algorithm. The biggest problem in Single Elimination, as we’ve said, would seem to be a scenario where the first team that gets eliminated by the winning team is actually the second-best team overall, yet lands in the (unsorted) bottom half. The NCAA works around this by seeding the teams, so that top-ranked teams cannot meet each other in the early rounds. The seeding process appears to be reliable at least in the most extreme case, as a sixteenth-seeded team has never defeated a first seed in the history of March Madness.
the fundamental challenge of memory really is one of organization rather than storage, perhaps it should change how we think about the impact of aging on our mental abilities. Recent work by a team of psychologists and linguists led by Michael Ramscar at the University of Tübingen has suggested that what we call “cognitive decline”—lags and retrieval errors—may not be about the search process slowing or deteriorating, but (at least partly) an unavoidable consequence of the amount of information we have to navigate getting bigger and bigger. Regardless of whatever other challenges aging brings, older brains—which must manage a greater store of memories—are literally solving harder computational problems with every passing day. The old can mock the young for their speed: “It’s because you don’t know anything yet!” Ramscar’s
can force your computer to show your electronic documents in a pile, as well. Computers’ default file-browsing interface makes you click through folders in alphabetical order—but the power of LRU suggests that you should override this, and display your files by “Last Opened” rather than “Name.” What you’re looking for will almost always be at or near the top.
How we spend our days is, of course, how we spend our lives. —ANNIE DILLARD
“We are what we repeatedly do,” you seem to recall Aristotle saying—whether it’s mop the floor, spend more time with family, file taxes on time, learn French.
Though we always manage to find some way to order the things we do in our days, as a rule we don’t consider ourselves particularly good at it—hence the perennial bestseller status of time-management guides. Unfortunately, the guidance we find in them is frequently divergent and inconsistent. Getting Things Done advocates a policy of immediately doing any task of two minutes or less as soon as it comes to mind. Rival bestseller Eat That Frog! advises beginning with the most difficult task and moving toward easier and easier things. The Now Habit suggests first scheduling one’s social engagements and leisure time and then filling the gaps with work—rather than the other way around, as we so often do. William James, the “father of American psychology,” asserts that “there’s nothing so fatiguing as the eternal hanging on of an uncompleted task,” but Frank Partnoy, in Wait, makes the case for deliberately not doing things right away.
Intuitively, Johnson’s algorithm works because regardless of how you sequence the loads, there’s going to be some time at the start when the washer is running but not the dryer, and some time at the end when the dryer is running but not the washer. By having the shortest washing times at the start, and the shortest drying times at the end, you maximize the amount of overlap—when the washer and dryer are running simultaneously. Thus you can keep the total amount of time spent doing laundry to the absolute minimum. Johnson’s analysis had yielded scheduling’s first optimal algorithm: start with the lightest wash, end with the smallest hamper.
single-machine scheduling literally before we even begin: make your goals explicit. We can’t declare some schedule a winner until we know how we’re keeping score. This is something of a theme in computer science: before you can have a plan, you must first choose a metric. And as it turns out, which metric we pick here will directly affect which scheduling approaches fare best.
Maybe instead we want to minimize the number of foods that spoil. Here a strategy called Moore’s Algorithm gives us our best plan. Moore’s Algorithm says that we start out just like with Earliest Due Date—by scheduling out our produce in order of spoilage date, earliest first, one item at a time. However, as soon as it looks like we won’t get to eating the next item in time, we pause, look back over the meals we’ve already planned, and throw out the biggest item (that is, the one that would take the most days to consume). For instance, that might mean forgoing the watermelon that would take a half dozen servings to eat; not even attempting it will mean getting to everything that follows a lot sooner. We then repeat this pattern, laying out the foods by spoilage date and tossing the largest already scheduled item any time we fall behind. Once everything that remains can be eaten in order of spoilage date without anything spoiling, we’ve got our plan.
Putting off work on a major project by attending instead to various trivial matters can likewise be seen as “the hastening of subgoal completion”—which is another way of saying that procrastinators are acting (optimally!) to reduce as quickly as possible the number of outstanding tasks on their minds. It’s not that they have a bad strategy for getting things done; they have a great strategy for the wrong metric.
fact, the weighted version of Shortest Processing Time is a pretty good candidate for best general-purpose scheduling strategy in the face of uncertainty. It offers a simple prescription for time management: each time a new piece of work comes in, divide its importance by the amount of time it will take to complete. If that figure is higher than for the task you’re currently doing, switch to the new one; otherwise stick with the current task. This algorithm is the closest thing that scheduling theory has to a skeleton key or Swiss Army knife, the optimal strategy not just for one flavor of problem but for many. Under certain assumptions it minimizes not just the sum of weighted completion times, as we might expect, but also the sum of the weights of the late jobs and the sum of the weighted lateness of those jobs.
Psychologists have shown that for us, the effects of switching tasks can include both delays and errors—at the scale of minutes rather than microseconds. To put that figure in perspective, anyone you interrupt more than a few times an hour is in danger of doing no work at all.
Peter Zijlstra, one of the head developers on the Linux operating system scheduler, puts it, “The caches are warm for the current workload, and when you context switch you pretty much invalidate all caches. And that hurts.” At the extreme, a program may run just long enough to swap its needed items into memory, before giving way to another program that runs just long enough to overwrite them in turn. This is thrashing: a system running full-tilt and accomplishing nothing at all. Denning first diagnosed this phenomenon in a memory-management context, but computer scientists now use the term “thrashing” to refer to pretty much any situation where the system grinds to a halt because it’s entirely preoccupied with metawork. A thrashing computer’s performance doesn’t bog down gradually. It falls off a cliff. “Real work” has dropped to effectively zero, which also means it’s going to be nearly impossible to get out.
Thrashing is a very recognizable human state. If you’ve ever had a moment where you wanted to stop doing everything just to have the chance to write down everything you were supposed to be doing, but couldn’t spare the time, you’ve thrashed.
In these cases there’s clearly no way to work any harder, but you can work…dumber. Along with considerations of memory, one of the biggest sources of metawork in switching contexts is the very act of choosing what to do next. This, too, can at times swamp the actual doing of the work. Faced with, say, an overflowing inbox of n messages, we know from sorting theory that repeatedly scanning it for the most important one to answer next will take O(n2) operations—n scans of n messages apiece. This means that waking up to an inbox that’s three times as full as usual could take you nine times as long to process. What’s more, scanning through those emails means swapping every message into your mind, one after another, before you respond to any of them: a surefire recipe for memory thrashing. In a thrashing state, you’re making essentially no progress, so even doing tasks in the wrong order is better than doing nothing at all. Instead of answering the most important emails first—which requires an assessment of the whole picture that may take longer than the work itself—maybe you should sidestep that quadratic-time quicksand by just answering the emails in random order, or in whatever order they happen to appear onscreen. Thinking along the same lines, the Linux core team, several years ago, replaced their scheduler with one that was less “smart” about calculating process priorities but more than made up for it by taking less time to calculate them.
Part of what makes real-time scheduling so complex and interesting is that it is fundamentally a negotiation between two principles that aren’t fully compatible. These two principles are called responsiveness and throughput: how quickly you can respond to things, and how much you can get done overall. Anyone who’s ever worked in an office environment can readily appreciate the tension between these two metrics. It’s part of the reason there are people whose job it is to answer the phone: they are responsive so that others may have throughput.
this is a principle that can be transferred to human lives. The moral is that you should try to stay on a single task as long as possible without decreasing your responsiveness below the minimum acceptable limit. Decide how responsive you need to be—and then, if you want to get things done, be no more responsive than that.
If you find yourself doing a lot of context switching because you’re tackling a heterogeneous collection of short tasks, you can also employ another idea from computer science: “interrupt coalescing.” If you have five credit card bills, for instance, don’t pay them as they arrive; take care of them all in one go when the fifth bill comes. As long as your bills are never due less than thirty-one days after they arrive, you can designate, say, the first of each month as “bill-paying day,” and sit down at that point to process every bill on your desk, no matter whether it came three weeks or three hours ago. Likewise, if none of your email correspondents require you to respond in less than twenty-four hours, you can limit yourself to checking your messages once a day. Computers themselves do something like this: they wait until some fixed interval and check everything, instead of context-switching to handle separate, uncoordinated interrupts from their various subcomponents.*4
In academia, holding office hours is a way of coalescing interruptions from students. And in the private sector, interrupt coalescing offers a redemptive view of one of the most maligned office rituals: the weekly meeting. Whatever their drawbacks, regularly scheduled meetings are one of our best defenses against the spontaneous interruption and the unplanned context switch.
The mathematical formula that describes this relationship, tying together our previously held ideas and the evidence before our eyes, has come to be known—ironically, as the real heavy lifting was done by Laplace—as Bayes’s Rule. And it gives a remarkably straightforward solution to the problem of how to combine preexisting beliefs with observed evidence: multiply their probabilities together.
He made the assumption that the moment when he encountered the Berlin Wall wasn’t special—that it was equally likely to be any moment in the wall’s total lifetime. And if any moment was equally likely, then on average his arrival should have come precisely at the halfway point (since it was 50% likely to fall before halfway and 50% likely to fall after). More generally, unless we know better we can expect to have shown up precisely halfway into the duration of any given phenomenon.*1 And if we assume that we’re arriving precisely halfway into something’s duration, the best guess we can make for how long it will last into the future becomes obvious: exactly as long as it’s lasted already. Gott saw the Berlin Wall eight years after it was built, so his best guess was that it would stand for eight years more. (It ended up being twenty.) This straightforward reasoning, which Gott named the Copernican Principle, results in a simple algorithm that can be used to make predictions about all sorts of topics. Without any preconceived expectations, we might use it to obtain predictions for the end of not only the Berlin Wall but any number of other short- and long-lived phenomena. The Copernican Principle predicts that the United States of America will last as a nation until approximately the year 2255, that Google will last until roughly 2032, and that the relationship your friend began a month ago will probably last about another month (maybe tell him not to RSVP to that wedding invitation just yet).
Recognizing that the Copernican Principle is just Bayes’s Rule with an uninformative prior answers a lot of questions about its validity. The Copernican Principle seems reasonable exactly in those situations where we know nothing at all—such as looking at the Berlin Wall in 1969, when we’re not even sure what timescale is appropriate. And it feels completely wrong in those cases where we do know something about the subject matter. Predicting that a 90-year-old man will live to 180 years seems unreasonable precisely because we go into the problem already knowing a lot about human life spans—and so we can do better. The richer the prior information we bring to Bayes’s Rule, the more useful the predictions we can get out of it.
There are a number of things in the world that don’t look normally distributed, however—not by a long shot. The average population of a town in the United States, for instance, is 8,226. But if you were to make a graph of the number of towns by population, you wouldn’t see anything remotely like a bell curve. There would be way more towns smaller than 8,226 than larger. At the same time, the larger ones would be way bigger than the average. This kind of pattern typifies what are called “power-law distributions.” These are also known as “scale-free distributions” because they characterize quantities that can plausibly range over many scales: a town can have tens, hundreds, thousands, tens of thousands, hundreds of thousands, or millions of residents, so we can’t pin down a single value for how big a “normal” town should be. The power-law distribution characterizes a host of phenomena in everyday life that have the same basic quality as town populations: most things below the mean, and a few enormous ones above it. Movie box-office grosses, which can range from four to ten figures, are another example. Most movies don’t make much money at all, but the occasional Titanic makes…well, titanic amounts.
It’s often lamented that “the rich get richer,” and indeed the process of “preferential attachment” is one of the surest ways to produce a power-law distribution. The most popular websites are the most likely to get incoming links; the most followed online celebrities are the ones most likely to gain new fans; the most prestigious firms are the ones most likely to attract new clients; the biggest cities are the ones most likely to draw new residents. In every case, a power-law distribution will result.
Examining the Copernican Principle, we saw that when Bayes’s Rule is given an uninformative prior, it always predicts that the total life span of an object will be exactly double its current age. In fact, the uninformative prior, with its wildly varying possible scales—the wall that might last for months or for millennia—is a power-law distribution. And for any power-law distribution, Bayes’s Rule indicates that the appropriate prediction strategy is a Multiplicative Rule: multiply the quantity observed so far by some constant factor. For an uninformative prior, that constant factor happens to be 2, hence the Copernican prediction; in other power-law cases, the multiplier will depend on the exact distribution you’re working with. For the grosses of movies, for instance, it happens to be about 1.4. So if you hear a movie has made $6 million so far, you can guess it will make about $8.4 million overall; if it’s made $90 million, guess it will top out at $126 million.
When we apply Bayes’s Rule with a normal distribution as a prior, on the other hand, we obtain a very different kind of guidance. Instead of a multiplicative rule, we get an Average Rule: use the distribution’s “natural” average—its single, specific scale—as your guide. For instance, if somebody is younger than the average life span, then simply predict the average; as their age gets close to and then exceeds the average, predict that they’ll live a few years more. Following this rule gives reasonable predictions for the 90-year-old and the 6-year-old: 94 and 77, respectively. (The 6-year-old gets a tiny edge over the population average of 76 by virtue of having made it through infancy: we know he’s not in the distribution’s left tail.)
Between those two extremes, there’s actually a third category of things in life: those that are neither more nor less likely to end just because they’ve gone on for a while. Sometimes things are simply…invariant. The Danish mathematician Agner Krarup Erlang, who studied such phenomena, formalized the spread of intervals between independent events into the function that now carries his name: the Erlang distribution. The shape of this curve differs from both the normal and the power-law: it has a wing-like contour, rising to a gentle hump, with a tail that falls off faster than a power-law but more slowly than a normal distribution. Erlang himself, working for the Copenhagen Telephone Company in the early twentieth century, used it to model how much time could be expected to pass between successive calls on a phone network. Since then, the Erlang distribution has also been used by urban planners and architects to model car and pedestrian traffic, and by networking engineers designing infrastructure for the Internet. There are a number of domains in the natural world, too, where events are completely independent from one another and the intervals between them thus fall on an Erlang curve.
The Danish mathematician Agner Krarup Erlang, who studied such phenomena, formalized the spread of intervals between independent events into the function that now carries his name: the Erlang distribution. The shape of this curve differs from both the normal and the power-law: it has a wing-like contour, rising to a gentle hump, with a tail that falls off faster than a power-law but more slowly than a normal distribution. Erlang himself, working for the Copenhagen Telephone Company in the early twentieth century, used it to model how much time could be expected to pass between successive calls on a phone network. Since then, the Erlang distribution has also been used by urban planners and architects to model car and pedestrian traffic, and by networking engineers designing infrastructure for the Internet. There are a number of domains in the natural world, too, where events are completely independent from one another and the intervals between them thus fall on an Erlang curve. Radioactive decay is one example, which means that the Erlang distribution perfectly models when to expect the next ticks of a Geiger counter. It also turns out to do a pretty good job of describing certain human endeavors—such as the amount of time politicians stay in the House of Representatives.
The Erlang distribution gives us a third kind of prediction rule, the Additive Rule: always predict that things will go…
a casino card-playing enthusiast tells his impatient spouse, for example, that he’ll quit for the day after hitting one more blackjack (the odds of which are about 20 to 1), he might cheerily predict, “I’ll be done in about twenty more hands!” If, an unlucky twenty hands later, she returns, asking how long he’s going to make her wait now, his answer will be unchanged: “I’ll be done in about twenty more hands!” It sounds like our indefatigable card shark has suffered a short-term memory loss—but, in fact, his prediction is entirely correct.…
If a casino card-playing enthusiast tells his impatient spouse, for example, that he’ll quit for the day after hitting one more blackjack (the odds of which are about 20 to 1), he might cheerily predict, “I’ll be done in about twenty more hands!” If, an unlucky twenty hands later, she returns, asking how long he’s going to make her wait now, his answer will be unchanged: “I’ll be done in about twenty more hands!” It sounds like our indefatigable card shark has suffered a short-term memory loss—but, in fact, his prediction is entirely correct.…
Failing the marshmallow test—and being less successful in later life—may not be about lacking willpower. It could be a result of believing that adults are not dependable: that they can’t be trusted to keep their word, that they disappear for intervals of arbitrary length. Learning self-control is important, but it’s equally important to grow up in an environment where adults are consistently present and trustworthy.
There’s a curious tension, then, between communicating with others and maintaining accurate priors about the world. When people talk about what interests them—and offer stories they think their listeners will find interesting—it skews the statistics of our experience. That makes it hard to maintain appropriate prior distributions. And the challenge has only increased with the development of the printing press, the nightly news, and social media—innovations that allow our species to spread language mechanically.
Consider how many times you’ve seen either a crashed plane or a crashed car. It’s entirely possible you’ve seen roughly as many of each—yet many of those cars were on the road next to you, whereas the planes were probably on another continent, transmitted to you via the Internet or television. In the United States, for instance, the total number of people who have lost their lives in commercial plane crashes since the year 2000 would not be enough to fill Carnegie Hall even half full. In contrast, the number of people in the United States killed in car accidents over that same time is greater than the entire population of Wyoming.
Simply put, the representation of events in the media does not track their frequency in the world. As sociologist Barry Glassner notes, the murder rate in the United States declined by 20% over the course of the 1990s, yet during that time period the presence of gun violence on American news increased by 600%.
If you want to be a good intuitive Bayesian—if you want to naturally make good predictions, without having to think about what kind of prediction rule is appropriate—you need to protect your priors. Counterintuitively, that might mean turning off the news.
By contrast, the leveling off predicted by the two-factor model is the forecast most consistent with what psychologists and economists say about marriage and happiness. (They believe, incidentally, that it simply reflects a return to normalcy—to people’s baseline level of satisfaction with their lives—rather than any displeasure with marriage itself.)
Perhaps nowhere, however, is overfitting as powerful and troublesome as in the world of business. “Incentive structures work,” as Steve Jobs put it. “So you have to be very careful of what you incent people to do, because various incentive structures create all sorts of consequences that you can’t anticipate.” Sam Altman, president of the startup incubator Y Combinator, echoes Jobs’s words of caution: “It really is true that the company will build whatever the CEO decides to measure.”
In fact, it’s incredibly difficult to come up with incentives or measurements that do not have some kind of perverse effect. In the 1950s, Cornell management professor V. F. Ridgway cataloged a host of such “Dysfunctional Consequences of Performance Measurements.” At a job-placement firm, staffers were evaluated on the number of interviews they conducted, which motivated them to run through the meetings as quickly as possible, without spending much time actually helping their clients find jobs. At a federal law enforcement agency, investigators given monthly performance quotas were found to pick easy cases at the end of the month rather than the most urgent ones. And at a factory, focusing on production metrics led supervisors to neglect maintenance and repairs, setting up future catastrophe. Such problems can’t simply be dismissed as a failure to achieve management goals. Rather, they are the opposite: the ruthless and clever optimization of the wrong thing.
Similarly, the FBI was forced to change its training after agents were found reflexively firing two shots and then holstering their weapon—a standard cadence in training—regardless of whether their shots had hit the target and whether there was still a threat. Mistakes like these are known in law enforcement and the military as “training scars,” and they reflect the fact that it’s possible to overfit one’s own preparation.
From a statistics viewpoint, overfitting is a symptom of being too sensitive to the actual data we’ve seen. The solution, then, is straightforward: we must balance our desire to find a good fit against the complexity of the models we use to do so. One way to choose among several competing models is the Occam’s razor principle, which suggests that, all things being equal, the simplest possible hypothesis is probably the correct one. Of course, things are rarely completely equal, so it’s not immediately obvious how to apply something like Occam’s razor in a mathematical context. Grappling with this challenge in the 1960s, Russian mathematician Andrey Tikhonov proposed one answer: introduce an additional term to your calculations that penalizes more complex solutions. If we introduce a complexity penalty, then more complex models need to do not merely a better job but a significantly better job of explaining the data to justify their greater complexity. Computer scientists refer to this principle—using constraints that penalize models for their complexity—as Regularization.
Grappling with this challenge in the 1960s, Russian mathematician Andrey Tikhonov proposed one answer: introduce an additional term to your calculations that penalizes more complex solutions. If we introduce a complexity penalty, then more complex models need to do not merely a better job but a significantly better job of explaining the data to justify their greater complexity. Computer scientists refer to this principle—using constraints that penalize models for their complexity—as Regularization.
So what do these complexity penalties look like? One algorithm, discovered in 1996 by biostatistician Robert Tibshirani, is called the Lasso and uses as its penalty the total weight of the different factors in the model.* By putting this downward pressure on the weights of the factors, the Lasso drives as many of them as possible completely to zero. Only the factors that have a big impact on the results remain in the equation—thus potentially transforming, say,…
Techniques like the Lasso are now ubiquitous in machine learning, but the same kind of principle—a penalty for complexity—also appears in nature. Living organisms get a certain push toward simplicity almost automatically, thanks to the constraints of time, memory, energy, and attention. The burden of metabolism, for instance, acts as a brake on the complexity of organisms, introducing a caloric penalty for overly elaborate machinery. The fact that the human brain burns about a fifth of humans’ total daily caloric intake is a testament to the evolutionary advantages that our intellectual abilities provide us with: the brain’s contributions must somehow more than pay for that sizable fuel bill. On the other hand, we can also…
The same kind of process is also believed to play a role at the neural level. In computer science, software models based on the brain, known as “artificial neural networks,” can learn arbitrarily complex functions—they’re even more flexible than our nine-factor model above—but precisely because of this very flexibility they are notoriously vulnerable to overfitting. Actual, biological neural networks sidestep some of this problem because they need to trade off their performance against the costs of maintaining it. Neuroscientists have suggested, for instance, that brains try…
The economist Harry Markowitz won the 1990 Nobel Prize in Economics for developing modern portfolio theory: his groundbreaking “mean-variance portfolio optimization” showed how an investor could make an optimal allocation among various funds and assets to maximize returns at a given level of risk. So when it came time to invest his own retirement savings, it seems like Markowitz should have been the one person perfectly equipped for the job. What did he decide to do? I should have computed the historical covariances of the asset classes and drawn an efficient frontier. Instead, I visualized my grief if the stock market went way up and I wasn’t in…
economist Harry Markowitz won the 1990 Nobel Prize in Economics for developing modern portfolio theory: his groundbreaking “mean-variance portfolio optimization” showed how an investor could make an optimal allocation among various funds and assets to maximize returns at a given level of risk. So when it came time to invest his own retirement savings, it seems like Markowitz should have been the one person perfectly equipped for the job. What did he decide to do? I should have computed the historical covariances of the asset classes and drawn an efficient frontier. Instead, I visualized my grief if the stock market went way up and I wasn’t in it—or if it went way down and I was completely in it. My intention was to minimize my future regret. So I split my contributions fifty-fifty between bonds and equities. Why in the world would he do that? The story of the Nobel Prize winner and his investment strategy could be presented as an example of human…
When it comes to portfolio management, it turns out that unless you’re highly confident in the information you have about the markets, you may actually be better off ignoring that information altogether. Applying Markowitz’s optimal portfolio allocation scheme requires having good estimates of the statistical properties of different investments. An error in those estimates can result in very different asset allocations, potentially increasing risk. In contrast, splitting your money evenly across stocks and bonds is not affected at all by what data you’ve…
Many prediction algorithms, for instance, start out by searching for the single most important factor rather than jumping to a multi-factor model. Only after finding that first factor do they look for the next most important factor to add to the model, then the next, and so on. Their models can therefore be kept from becoming overly complex simply by stopping the process short, before overfitting has had a chance to creep in. A related approach to calculating predictions considers one data point at a time, with the model tweaked to account for each new point before more points are added; there, too, the complexity of the model increases gradually, so stopping the process short can help keep it from overfitting.
with all issues involving overfitting, how early to stop depends on the gap between what you can measure and what really matters. If you have all the facts, they’re free of all error and uncertainty, and you can directly assess whatever is important to you, then don’t stop early. Think long and hard: the complexity and effort are appropriate. But that’s almost never the case. If you have high uncertainty and limited data, then do stop early by all means. If you don’t have a clear read on how your work will be evaluated, and by whom, then it’s not worth the extra time to make it perfect with respect to your own (or anyone else’s) idiosyncratic guess at what perfection might be. The greater the uncertainty, the bigger the gap between what you can measure and what matters, the more you should watch out for overfitting—that is, the more you should prefer simplicity, and the earlier you should stop.
They asserted what’s now known as the Cobham–Edmonds thesis: an algorithm should be considered “efficient” if it runs in what’s called “polynomial time”—that is, O(n2), O(n3), or in fact n to the power of any number at all. A problem, in turn, is considered “tractable” if we know how to solve it using an efficient algorithm. A problem we don’t know how to solve in polynomial time, on the other hand, is considered “intractable.” And at anything but the smallest scales, intractable problems are beyond the reach of solution by computers, no matter how powerful.
When computer scientists are up against a formidable challenge, their minds also turn to relaxation, as they pass around books like An Introduction to Relaxation Methods or Discrete Relaxation Techniques. But they don’t relax themselves; they relax the problem. One of the simplest forms of relaxation in computer science is known as Constraint Relaxation. In this technique, researchers remove some of the problem’s constraints and set about solving the problem they wish they had. Then, after they’ve made a certain amount of headway, they try to add the constraints back in. That is, they make the problem temporarily easier to handle before bringing it back to reality.
One of the simplest forms of relaxation in computer science is known as Constraint Relaxation. In this technique, researchers remove some of the problem’s constraints and set about solving the problem they wish they had. Then, after they’ve made a certain amount of headway, they try to add the constraints back in. That is, they make the problem temporarily easier to handle before bringing it back to reality.
The traveling salesman problem, like Meghan Bellows’s search for the best seating arrangement, is a particular kind of optimization problem known as “discrete optimization”—that is, there’s no smooth continuum among its solutions. The salesman goes either to this town or to that one; you’re either at table five or at table six. There are no shades of gray in between.
As we noted, discrete optimization’s commitment to whole numbers—a fire department can have one engine in the garage, or two, or three, but not two and a half fire trucks, or π of them—is what makes discrete optimization problems so hard to solve. In fact, both the fire truck problem and the party invitation problem are intractable: no general efficient solution for them exists. But, as it turns out, there do exist a number of efficient strategies for solving the continuous versions of these problems, where any fraction or decimal is a possible solution.
One day as a child, Brian was complaining to his mother about all the things he had to do: his homework, his chores….“Technically, you don’t have to do anything,” his mother replied. “You don’t have to do what your teachers tell you. You don’t have to do what I tell you. You don’t even have to obey the law. There are consequences to everything, and you get to decide whether you want to face those consequences.” Brian’s kid-mind was blown. It was a powerful message, an awakening of a sense of agency, responsibility, moral judgment. It was something else, too: a powerful computational technique called Lagrangian Relaxation. The idea behind Lagrangian Relaxation is simple. An optimization problem has two parts: the rules and the scorekeeping. In Lagrangian Relaxation, we take some of the problem’s constraints and bake them into the scoring system instead. That is, we take the impossible and downgrade it to costly. (In a wedding seating optimization, for instance, we might relax the constraint that tables each hold ten people max, allowing overfull tables but with some kind of elbow-room penalty.) When an optimization problem’s constraints say “Do it, or else!,” Lagrangian Relaxation replies, “Or else what?” Once we can color outside the lines—even just a little bit, and even at a steep cost—problems become tractable that weren’t tractable before.
There are many ways to relax a problem, and we’ve seen three of the most important. The first, Constraint Relaxation, simply removes some constraints altogether and makes progress on a looser form of the problem before coming back to reality. The second, Continuous Relaxation, turns discrete or binary choices into continua: when deciding between iced tea and lemonade, first imagine a 50–50 “Arnold Palmer” blend and then round it up or down. The third, Lagrangian Relaxation, turns impossibilities into mere penalties, teaching the art of bending the rules (or breaking them and accepting the consequences). A rock band deciding which songs to cram into a limited set, for instance, is up against what computer scientists call the “knapsack problem”—a puzzle that asks one to decide which of a set of items of different bulk and importance to pack into a confined volume. In its strict formulation the knapsack problem is famously intractable, but that needn’t discourage our relaxed rock stars. As demonstrated in several celebrated examples, sometimes it’s better to simply play a bit past the city curfew and incur the related fines than to limit the show to the available slot. In fact, even when you don’t commit the infraction, simply imagining it can be illuminating.
may look strange, given that O(n2) seemed so odious in the sorting context, to call it “efficient” here. But the truth is that even exponential time with an unassumingly small base number, like O(2n), quickly gets hellish even when compared to a polynomial with a large base, like n10. The exponent will always overtake the polynomial at some problem size—in this case, if you’re sorting more than several dozen items, n10 starts to look like a walk in the park compared to 2n. Ever since Cobham and Edmonds’s work, this chasm between “polynomials” (n- to- the- something) and “exponentials” (something-to-the-n) has served as the field’s de facto out-of-bounds marker.
In contrast to the standard “deterministic” algorithms we typically imagine computers using, where one step follows from another in exactly the same way every time, a randomized algorithm uses randomly generated numbers to solve a problem. Recent work in computer science has shown that there are cases where randomized algorithms can produce good approximate answers to difficult questions faster than all known deterministic algorithms. And while they do not always guarantee the optimal solutions, randomized algorithms can get surprisingly close to them in a fraction of the time, just by strategically flipping a few coins while their deterministic cousins sweat it out.
Recent work in computer science has shown that there are cases where randomized algorithms can produce good approximate answers to difficult questions faster than all known deterministic algorithms. And while they do not always guarantee the optimal solutions, randomized algorithms can get surprisingly close to them in a fraction of the time, just by strategically flipping a few coins while their deterministic cousins sweat it out.
Scott Fitzgerald once wrote that “the test of a first-rate intelligence is the ability to hold two opposing ideas in mind at the same time and still retain the ability to function.”
Ulam’s insight—that sampling can succeed where analysis fails—was also crucial to solving some of the difficult nuclear physics problems that arose at Los Alamos. A nuclear reaction is a branching process, where possibilities multiply just as wildly as they do in cards: one particle splits in two, each of which may go on to strike others, causing them to split in turn, and so on. Exactly calculating the chances of some particular outcome of that process, with many, many particles interacting, is hard to the point of impossibility. But simulating it, with each interaction being like turning over a new card, provides an alternative.
Ulam developed the idea further with John von Neumann, and worked with Nicholas Metropolis, another of the physicists from the Manhattan Project, on implementing the method on the Los Alamos computer. Metropolis named this approach—replacing exhaustive probability calculations with sample simulations—the Monte Carlo Method, after the Monte Carlo casino in Monaco, a place equally dependent on the vagaries of chance. The Los Alamos team was able to use it to solve key problems in nuclear physics. Today the Monte Carlo Method is one of the cornerstones of scientific computing.
For millennia, the study of prime numbers was believed to be, as G. H. Hardy put it, “one of the most obviously useless branches” of mathematics. But it lurched into practicality in the twentieth century, becoming pivotal in cryptography and online security. As it happens, it is much easier to multiply primes together than to factor them back out. With big enough primes—say, a thousand digits—the multiplication can be done in a fraction of a second while the factoring could take literally millions of years; this makes for what is known as a “one-way function.” In modern encryption, for instance, secret primes known only to the sender and recipient get multiplied together to create huge composite numbers that can be transmitted publicly without fear, since factoring the product would take any eavesdropper way too long to be worth attempting. Thus virtually all secure communication online—be it commerce, banking, or email—begins with a hunt for prime numbers.
Though you may have never heard of the Miller-Rabin test, your laptop, tablet, and phone know it well. Several decades after its discovery, it is still the standard method used to find and check primes in many domains. It’s working behind the scenes whenever you use your credit card online, and almost any time secure communications are sent through the air or over wires.
And for some other problems, randomness still provides the only known route to efficient solutions. One curious example from mathematics is known as “polynomial identity testing.” If you have two polynomial expressions, such as 2x3 + 13x2 + 22x + 8 and (2x + 1) × (x + 2) × (x + 4), working out whether those expressions are in fact the same function—by doing all the multiplication, then comparing the results—can be incredibly time-consuming, especially as the number of variables increases. Here again randomness offers a way forward: just generate some random xs and plug them in. If the two expressions are not the same, it would be a big coincidence if they gave the same answer for some randomly generated input. And an even bigger coincidence if they also gave identical answers for a second random input. And a bigger coincidence still if they did it for three random inputs in a row. Since there is no known deterministic algorithm for efficiently testing polynomial identity, this randomized method—with multiple observations quickly giving rise to near-certainty—is the only practical one we have.
When we need to make sense of, say, national health care reform—a vast apparatus too complex to be readily understood—our political leaders typically offer us two things: cherry-picked personal anecdotes and aggregate summary statistics. The anecdotes, of course, are rich and vivid, but they’re unrepresentative. Almost any piece of legislation, no matter how enlightened or misguided, will leave someone better off and someone worse off, so carefully selected stories don’t offer any perspective on broader patterns. Aggregate statistics, on the other hand, are the reverse: comprehensive but thin. We might learn, for instance, whether average premiums fell nationwide, but not how that change works out on a more granular level: they might go down for most but, Omelas-style, leave some specific group—undergraduates, or Alaskans, or pregnant women—in dire straits. A statistic can only tell us part of the story, obscuring any underlying heterogeneity. And often we don’t even know which statistic we need.
GiveDirectly, which distributes unconditional cash transfers to people living in extreme poverty in Kenya and Uganda. It has attracted attention for rethinking conventional charity practices on a number of levels: not only in its unusual mission, but in the level of transparency and accountability it brings to its own process. And the latest element of the status quo that it’s challenging is success stories. “If you regularly check our website, blog, or Facebook page,” writes program assistant Rebecca Lange, “you may have noticed something you don’t often see: stories and photos of our recipients.” The problem isn’t that the glowing stories proffered by other charities aren’t true. Rather, the very fact that they were deliberately chosen to showcase successes makes it unclear how much information can be gleaned from them. So Give Directly decided to put a twist on this conventional practice as well. Every Wednesday, the GiveDirectly team selects a cash recipient at random, sends out a field officer to interview them, and publishes the field officer’s notes verbatim, no matter what.
Time and space are at the root of the most familiar tradeoffs in computer science, but recent work on randomized algorithms shows that there’s also another variable to consider: certainty. As Harvard’s Michael Mitzenmacher puts it, “What we’re going to do is come up with an answer which saves you in time and space and trades off this third dimension: error probability.” Asked
But what if we only needed to be mostly sure this URL was new to us? That’s where the Bloom filter comes in. Named for its inventor, Burton H. Bloom, a Bloom filter works much like the Rabin-Miller primality test: the URL is entered into a set of equations that esssentially check for “witnesses” to its novelty. (Rather than proclaim “n is not prime,” these equations say “I have not seen n before.”) If you’re willing to tolerate an error rate of just 1% or 2%, storing your findings in a probabilistic data structure like a Bloom filter will save you significant amounts of both time and space. And the usefulness of such filters is not confined to search engines: Bloom filters have shipped with a number of recent web browsers to check URLs against a list of known malicious websites, and they are also an important part of cryptocurrencies like Bitcoin.
But there’s also a third approach: instead of turning to full-bore randomness when you’re stuck, use a little bit of randomness every time you make a decision. This technique, developed by the same Los Alamos team that came up with the Monte Carlo Method, is called the Metropolis Algorithm. The Metropolis Algorithm is like Hill Climbing, trying out different small-scale tweaks on a solution, but with one important difference: at any given point, it will potentially accept bad tweaks as well as good ones.
Taking the ten-city vacation problem from above, we could start at a “high temperature” by picking our starting itinerary entirely at random, plucking one out of the whole space of possible solutions regardless of price. Then we can start to slowly “cool down” our search by rolling a die whenever we are considering a tweak to the city sequence. Taking a superior variation always makes sense, but we would only take inferior ones when the die shows, say, a 2 or more. After a while, we’d cool it further by only taking a higher-price change if the die shows a 3 or greater—then 4, then 5. Eventually we’d be mostly hill climbing, making the inferior move just occasionally when the die shows a 6. Finally we’d start going only uphill, and stop when we reached the next local max. This approach, called Simulated Annealing, seemed like an intriguing way to map physics onto problem solving. But would it work?
But any distrust regarding the analogy-based approach would soon vanish: at IBM, Kirkpatrick and Gelatt’s simulated annealing algorithms started making better chip layouts than the guru. Rather than keep mum about their secret weapon and become cryptic guru figures themselves, they published their method in a paper in Science, opening it up to others. Over the next few decades, that paper would be cited a whopping thirty-two thousand times. To this day, simulated annealing remains one of the most promising approaches to optimization problems known to the field.
You might worry that making every decision by flipping a coin could lead to trouble, not least with your boss, friends, and family. And it’s true that mainlining randomness into your life is not necessarily a recipe for success. The cult classic 1971 novel The Dice Man by Luke Rhinehart (real name: George Cockcroft) provides a cautionary tale. Its narrator, a man who replaces decision-making with dice rolling, quickly ends up in situations that most of us would probably like to avoid. But perhaps it’s just a case of a little knowledge being a dangerous thing. If the Dice Man had only had a deeper grasp of computer science, he’d have had some guidance. First, from Hill Climbing: even if you’re in the habit of sometimes acting on bad ideas, you should always act on good ones. Second, from the Metropolis Algorithm: your likelihood of following a bad idea should be inversely proportional to how bad an idea it is. Third, from Simulated Annealing: you should front-load randomness, rapidly cooling out of a totally random state, using ever less and less randomness as time goes on, lingering longest as you approach freezing. Temper yourself—literally.
Computer scientists know this concept as the “Byzantine generals problem.” Imagine two generals, on opposite sides of a valley that contains their common enemy, attempting to coordinate an attack. Only by perfect synchronization will they succeed; for either to attack alone is suicide. What’s worse, any messages from one general to the other must be delivered by hand across the very terrain that contains the enemy, meaning there’s a chance that any given message will never arrive. The first general, say, suggests a time for the attack, but won’t dare go for it unless he knows for sure that his comrade is moving, too. The second general receives the orders and sends back a confirmation—but won’t dare attack unless he knows that the first general received that confirmation (since otherwise the first general won’t be going). The first general receives the confirmation—but won’t attack until he’s certain that the second general knows he did. Following this chain of logic requires an infinite series of messages, and obviously that won’t do. Communication is one of those delightful things that work only in practice; in theory it’s impossible.
In TCP, a failure generally leads to retransmission rather than death, so it’s considered enough for a session to begin with what’s called a “triple handshake.” The visitor says hello, the server acknowledges the hello and says hello back, the visitor acknowledges that, and if the server receives this third message, then no further confirmation is needed and they’re off to the races. Even after this initial connection is made, however, there’s still a risk that some later packets may be damaged or lost in transit, or arrive out of order. In the postal mail, package delivery can be confirmed via return receipts; online, packet delivery is confirmed by what are called acknowledgment packets, or ACKs. These are critical to the functioning of the network. The way that ACKs work is both simple and clever. Behind the scenes of the triple handshake, each machine provides the other with a kind of serial number—and it’s understood that every packet sent after that will increment those serial numbers by one each time, like checks in a checkbook. For instance, if your computer initiates contact with a web server, it might send that server, say, the number 100. The ACK sent by the server will in turn specify the serial number at which the server’s own packets will begin (call it 5,000), and also will say “Ready for 101.” Your machine’s ACK will carry the number 101 and will convey in turn “Ready for 5,001.” (Note that these two numbering schemes are totally independent, and the number that begins each sequence is typically chosen at random.) This mechanism offers a ready way to pinpoint when packets have gone astray. If the server is expecting 101 but instead gets 102, it will send an ACK to packet 102 that still says “Ready for 101.” If it next gets packet 103, it will say, again, “Ready for 101.” Three such redundant ACKs in a row would signal to your machine that 101 isn’t just delayed but hopelessly gone, so it will resend that packet. At that point, the server (which has kept packets 102 and 103) will send an ACK saying “Ready for 104” to signal that the sequence has been restored. All those acknowledgments can actually add up to a considerable amount of traffic. We think of, say, a large file transfer as a one-way operation, but in fact the recipient is sending hundreds of “control messages” back to the sender.
Ironically, one of the few exceptions to this is in transmitting the human voice. Real-time voice communications, such as Skype, typically do not use TCP, which underlies most of the rest of the Internet. As researchers discovered in the early days of networking, using reliable, robust protocols—with all their ACKs and retransmission of lost packets—to transmit the human voice is overkill. The humans provide the robustness themselves. As Cerf explains, “In the case of voice, if you lose a packet, you just say, ‘Say that again, I missed something.’ 
The breakthrough turned out to be increasing the average delay after every successive failure—specifically, doubling the potential delay before trying to transmit again. So after an initial failure, a sender would randomly retransmit either one or two turns later; after a second failure, it would try again anywhere from one to four turns later; a third failure in a row would mean waiting somewhere between one and eight turns, and so on. This elegant approach allows the network to accommodate potentially any number of competing signals. Since the maximum delay length (2, 4, 8, 16…) forms an exponential progression, it’s become known as Exponential Backoff. Exponential Backoff was a huge part of the successful functioning of the ALOHAnet beginning in 1971, and in the 1980s it was baked into TCP, becoming a critical part of the Internet. All these decades later, it still is. As one influential paper puts it, “For a transport endpoint embedded in a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations, only one scheme has any hope of working—exponential backoff.”
At the heart of TCP congestion control is an algorithm called Additive Increase, Multiplicative Decrease, or AIMD. Before AIMD kicks in, a new connection will ramp up its transmission rate aggressively: if the first packet is received successfully it sends out two more, if both of those get through it sends out a batch of four, and so on. But as soon as any packet’s ACK does not come back to the sender, the AIMD algorithm takes over. Under AIMD, any fully received batch of packets causes the number of packets in flight not to double but merely to increase by 1, and dropped packets cause the transmission rate to cut back by half (hence the name Additive Increase, Multiplicative Decrease). Essentially, AIMD takes the form of someone saying, “A little more, a little more, a little more, whoa, too much, cut way back, okay a little more, a little more…” Thus it leads to a characteristic bandwidth shape known as the “TCP sawtooth”—steady upward climbs punctuated by steep drops. Why such a sharp, asymmetrical decrease? As Jacobson and Karels explain, the first time AIMD kicks in is when a connection has experienced the first dropped packet in its initial aggressive ramping-up phase. Because that initial phase involved doubling the rate of transmission with every successful volley, cutting the speed back by half as soon as there’s been a problem is entirely appropriate. And once a transmission is in progress, if it starts to falter again that’s likely to be because some new connection is competing for the network. The most conservative assessment of that situation—namely, assuming you were the only person using the network and now there’s a second person taking half the resources—also leads to cutting back by half. Conservatism here is essential: a network can stabilize only if its users pull back at least as fast as the rate at which it is being overloaded. For the same reason, a merely additive increase helps stabilize things for everyone, preventing rapid overload-and-recovery cycles.
The satirical “Peter Principle,” articulated in the 1960s by education professor Laurence J. Peter, states that “every employee tends to rise to his level of incompetence.” The idea is that in a hierarchical organization, anyone doing a job proficiently will be rewarded with a promotion into a new job that may involve more complex and/or different challenges. When the employee finally reaches a role in which they don’t perform well, their march up the ranks will stall, and they will remain in that role for the rest of their career. Thus it stands to reason, goes the ominous logic of the Peter Principle, that eventually every spot in an organization will come to be filled by someone doing that job badly. Some fifty years before Peter’s formulation, Spanish philosopher José Ortega y Gasset in 1910 voiced the same sentiment. “Every public servant should be demoted to the immediately lower rank,” he wrote, “because they were advanced until they became incompetent.” Some organizations have attempted to remediate the Peter Principle by simply firing employees who don’t advance. The so-called Cravath System, devised by leading law firm Cravath, Swaine & Moore, involves hiring almost exclusively recent graduates, placing them into the bottom ranks, and then routinely either promoting or firing them over the following years. In 1980, the US Armed Forces adopted a similar “up or out” policy with the Defense Officer Personnel Management Act. The United Kingdom has likewise pursued what they call “manning control,” to great controversy.
Understanding the exact function and meaning of human backchannels continues to be an active area of research. In 2014, for instance, UC Santa Cruz’s Jackson Tolins and Jean Fox Tree demonstrated that those inconspicuous “uh-huhs” and “yeahs” and “hmms” and “ohs” that pepper our speech perform distinct, precise roles in regulating the flow of information from speaker to listener—both its rate and level of detail. Indeed, they are every bit as critical as ACKs are in TCP. Says Tolins, “Really, while some people may be worse than others, ‘bad storytellers’ can at least partly blame their audience.” This realization has had the unexpected side effect of taking off some of the pressure when he gives lectures—including, of course, lectures about that very result. “Whenever I give these backchannel talks, I always tell the audience that the way they are backchanneling to my talk right now is changing what I say,” he jokes, “so they’re responsible for how well I do.”
The most prevalent critique of modern communications is that we are “always connected.” But the problem isn’t that we’re always connected; we’re not. The problem is that we’re always buffered. The difference is enormous. The feeling that one needs to look at everything on the Internet, or read all possible books, or see all possible shows, is bufferbloat. You miss an episode of your favorite series and watch it an hour, a day, a decade later. You go on vacation and come home to a mountain of correspondence. It used to be that people knocked on your door, got no response, and went away. Now they’re effectively waiting in line when you come home. Heck, email was deliberately designed to overcome Tail Drop. As its inventor, Ray Tomlinson, puts it: At the time there was no really good way to leave messages for people. The telephone worked up to a point, but someone had to be there to receive the call. And if it wasn’t the person you wanted to get, it was an administrative assistant or an answering service or something of that sort. That was the mechanism you had to go through to leave a message, so everyone latched onto the idea that you could leave messages on the computer. In other words, we asked for a system that would never turn a sender away, and for better or worse we got one. Indeed, over the past fifteen years, the move from circuit switching to packet switching has played itself out across society. We used to request dedicated circuits with others; now we send them packets and wait expectantly for ACKs. We used to reject; now we defer. The much-lamented “lack of idleness” one reads about is, perversely, the primary feature of buffers: to bring average throughput up to peak throughput. Preventing idleness is what they do. You check email from the road, from vacation, on the toilet, in the middle of the night. You are never, ever bored. This is the mixed blessing of buffers, operating as advertised.
Computer science illustrates the fundamental limitations of this kind of reasoning with what’s called the “halting problem.” As Alan Turing proved in 1936, a computer program can never tell you for sure whether another program might end up calculating forever without end—except by simulating the operation of that program and thus potentially going off the deep end itself. (Accordingly, programmers will never have automated tools that can tell them whether their software will freeze.) This is one of the foundational results in all of computer science, on which many other proofs hang.*1 Simply put, any time a system—be it a machine or a mind—simulates the workings of something as complex as itself, it finds its resources totally maxed out, more or less by definition. Computer scientists have a term for this potentially endless journey into the hall of mirrors, minds simulating minds simulating minds: “recursion.”
“In poker, you never play your hand,” James Bond says in Casino Royale; “you play the man across from you.” In fact, what you really play is a theoretically infinite recursion. There’s your own hand and the hand you believe your opponent to have; then the hand you believe your opponent believes you have, and the hand you believe your opponent believes you to believe he has…and on it goes. “I don’t know if this is an actual game-theory term,” says the world’s top-rated poker player, Dan Smith, “but poker players call it ‘leveling.’ Level one is ‘I know.’ Two is ‘you know that I know.’ Three, ‘I know that you know that I know.’ There are situations where it just comes up where you are like, ‘Wow, this is a really silly spot to bluff but if he knows that it is a silly spot to bluff then he won’t call me and that’s where it’s the clever spot to bluff.’ Those things happen.”
In one of the seminal results in game theory, the mathematician John Nash proved in 1951 that every two-player game has at least one equilibrium. This major discovery would earn Nash the Nobel Prize in Economics in 1994 (and lead to the book and film A Beautiful Mind, about Nash’s life). Such an equilibrium is now often spoken of as the “Nash equilibrium”—the “Nash” that Dan Smith always tries to keep track
In a game-theory context, knowing that an equilibrium exists doesn’t actually tell us what it is—or how to get there. As UC Berkeley computer scientist Christos Papadimitriou writes, game theory “predicts the agents’ equilibrium behavior typically with no regard to the ways in which such a state will be reached—a consideration that would be a computer scientist’s foremost concern.” Stanford’s Tim Roughgarden echoes the sentiment of being unsatisfied with Nash’s proof that equilibria always exist. “Okay,” he says, “but we’re computer scientists, right? Give us something we can use. Don’t just tell me that it’s there; tell me how to find it.” And so, the original field of game theory begat algorithmic game theory—that is, the study of theoretically ideal strategies for games became the study of how machines (and people) come up with strategies for games.
“If an equilibrium concept is not efficiently computable, much of its credibility as a prediction of the behavior of rational agents is lost.” MIT’s Scott Aaronson agrees. “In my opinion,” he says, “if the theorem that Nash equilibria exist is considered relevant to debates about (say) free markets versus government intervention, then the theorem that finding those equilibria is [intractable] should be considered relevant also.” The predictive abilities of Nash equilibria only matter if those equilibria can actually be found by the players. To quote eBay’s former director of research, Kamal Jain, “If your laptop cannot find it, neither can the market.”
In fact, this makes defection not merely the equilibrium strategy but what’s known as a dominant strategy. A dominant strategy avoids recursion altogether, by being the best response to all of your opponent’s possible strategies—so you don’t even need to trouble yourself getting inside their head at all. A dominant strategy is a powerful thing.
But now we’ve arrived at the paradox. If everyone does the rational thing and follows the dominant strategy, the story ends with both of you serving five years of hard time—which, compared to freedom and a cool half million apiece, is dramatically worse for everyone involved. How could that have happened? This has emerged as one of the major insights of traditional game theory: the equilibrium for a set of players, all acting rationally in their own interest, may not be the outcome that is actually best for those players. Algorithmic game theory, in keeping with the principles of computer science, has taken this insight and quantified it, creating a measure called “the price of anarchy.” The price of anarchy measures the gap between cooperation (a centrally designed or coordinated solution) and competition (where each participant is independently trying to maximize the outcome for themselves). In a game like the prisoner’s dilemma, this price is effectively infinite: increasing the amount of cash at stake and lengthening the jail sentences can make the gap between possible outcomes arbitrarily wide, even as the dominant strategy stays the same. There’s no limit to how painful things can get for the players if they don’t coordinate. But in other games, as algorithmic game theorists would discover, the price of anarchy is not nearly so bad.
Surprisingly, Tim Roughgarden and Cornell’s Éva Tardos proved in 2002 that the “selfish routing” approach has a price of anarchy that’s a mere 4/3. That is, a free-for-all is only 33% worse than perfect top-down coordination. Roughgarden and Tardos’s work has deep implications both for urban planning of physical traffic and for network infrastructure. Selfish routing’s low price of anarchy may explain, for instance, why the Internet works as well as it does without any central authority managing the routing of individual packets. Even if such coordination were possible, it wouldn’t add very much.
When it comes to traffic of the human kind, the low price of anarchy cuts both ways. The good news is that the lack of centralized coordination is making your commute at most only 33% worse. On the other hand, if you’re hoping that networked, self-driving autonomous cars will bring us a future of traffic utopia, it may be disheartening to learn that today’s selfish, uncoordinated drivers are already pretty close to optimal. It’s true that self-driving cars should reduce the number of road accidents and may be able to drive more closely together, both of which would speed up traffic. But from a congestion standpoint, the fact that anarchy is only 4/3 as congested as perfect coordination means that perfectly coordinated commutes will only be 3/4 as congested as they are now. It’s a bit like the famous line by James Branch Cabell: “The optimist proclaims that we live in the best of all possible worlds; and the pessimist fears this is true.”
Quantifying the price of anarchy has given the field a concrete and rigorous way to assess the pros and cons of decentralized systems, which has broad implications across any number of domains where people find themselves involved in game-playing (whether they know it or not). A low price of anarchy means the system is, for better or worse, about as good on its own as it would be if it were carefully managed. A high price of anarchy, on the other hand, means that things have the potential to turn out fine if they’re carefully coordinated—but that without some form of intervention, we are courting disaster. The prisoner’s dilemma is clearly of this latter type. Unfortunately, so are many of the most critical games the world must play.
The logic of this type of game is so pervasive that we don’t even have to look to misdeeds to see it running amok. We can just as easily end up in a terrible equilibrium with a clean conscience. How? Look no further than your company vacation policy. In America, people work some of the longest hours in the world; as the Economist put it, “nowhere is the value of work higher and the value of leisure lower.” There are few laws mandating that employers provide time off, and even when American employees do get vacation time they don’t use it. A recent study showed that the average worker takes only half of the vacation days granted them, and a stunning 15% take no vacation at all. At the present moment, the Bay Area (where the two of us live) is attempting to remedy this sorry state of affairs by going through a radical paradigm shift when it comes to vacation policy—a shift that is very well meaning and completely, apocalyptically doomed. The premise sounds innocent enough: instead of metering out some fixed arbitrary number of days for each employee, then wasting HR man-hours making sure no one goes over their limit, why not just let your employees free? Why not simply allow them unlimited vacation? Anecdotal reports thus far are mixed—but from a game-theoretic perspective, this approach is a nightmare. All employees want, in theory, to take as much vacation as possible. But they also all want to take just slightly less vacation than each other, to be perceived as more loyal, more committed, and more dedicated (hence more promotion-worthy). Everyone looks to the others for a baseline, and will take just slightly less than that. The Nash equilibrium of this game is zero. As the CEO of software company Travis CI, Mathias Meyer, writes, “People will hesitate to take a vacation as they don’t want to seem like that person who’s taking the most vacation days. It’s a race to the bottom.”
So what can we, as players, do when we find ourselves in such a situation—either the two-party prisoner’s dilemma, or the multi-party tragedy of the commons? In a sense, nothing. The very stability that these bad equilibria have, the thing that makes them equilibria, becomes damnable. By and large we cannot shift the dominant strategies from within. But this doesn’t mean that bad equilibria can’t be fixed. It just means that the solution is going to have to come from somewhere else.
The prisoner’s dilemma has been the focal point for generations of debate and controversy about the nature of human cooperation, but University College London game theorist Ken Binmore sees at least some of that controversy as misguided. As he argues, it’s “just plain wrong that the Prisoner’s Dilemma captures what matters about human cooperation. On the contrary, it represents a situation in which the dice are as loaded against the emergence of cooperation as they could possibly be.”
This brings us to a branch of game theory known as “mechanism design.” While game theory asks what behavior will emerge given a set of rules, mechanism design (sometimes called “reverse game theory”) works in the other direction, asking: what rules will give us the behavior we want to see? And if game theory’s revelations—like the fact that an equilibrium strategy might be rational for each player yet bad for everyone—have proven counterintuitive, the revelations of mechanism design are even more so. Let’s return you and your bank-robbing co-conspirator to the jail cell for another go at the prisoner’s dilemma, with one crucial addition: the Godfather. Now you and your fellow thief are members of a crime syndicate, and the don has made it, shall we say, all too clear that any informants will sleep with the fishes. This alteration of the game’s payoffs has the effect of limiting the actions you can take, yet ironically makes it far more likely that things will end well, both for you and your partner. Since defection is now less attractive (to put it mildly), both prisoners are induced to cooperate, and both will confidently walk away half a million dollars richer. Minus, of course, a nominal tithe to the don. The counterintuitive and powerful thing here is we can worsen every outcome—death on the one hand, taxes on the other—yet make everyone’s lives better by shifting the equilibrium.
And adding divine force to injunctions against other kinds of antisocial behavior, such as murder, adultery, and theft, is likewise a way to solve some of the game-theoretic problems of living in a social group. God happens to be even better than government in this respect, since omniscience and omnipotence provide a particularly strong guarantee that taking bad actions will have dire consequences. It turns out there’s no Godfather quite like God the Father.
The redwoods of California are some of the oldest and most majestic living things on the planet. From a game-theoretic standpoint, though, they’re something of a tragedy. The only reason they’re so tall is that they’re trying to be taller than each other—up to the point where the harms of overextension are finally even worse than the harms of getting shaded out. As Richard Dawkins puts it, The canopy can be thought of as an aerial meadow, just like a rolling grassland prairie, but raised on stilts. The canopy is gathering solar energy at much the same rate as a grassland prairie would. But a substantial portion of the energy is “wasted” by being fed straight into the stilts, which do nothing more useful than loft the “meadow” high in the air, where it picks up exactly the same harvest of photons as it would—at far lower cost—if it were laid flat on the ground.
Emotion, for the bitter, retaliatory consumer and for the convenience-store hero alike, is our own species taking over the controls for a minute. “Morality is herd instinct in the individual,” wrote Nietzsche. Paraphrasing slightly, we might hazard that emotion is mechanism design in the species. Precisely because feelings are involuntary, they enable contracts that need no outside enforcement. Revenge almost never works out in favor of the one who seeks it, and yet someone who will respond with “irrational” vehemence to being taken advantage of is for that very reason more likely to get a fair deal. As Cornell economist Robert Frank puts it, “If people expect us to respond irrationally to the theft of our property, we will seldom need to, because it will not be in their interests to steal it. Being predisposed to respond irrationally serves much better here than being guided only by material self-interest.”
(Lest you think that civilized modern humans have legal contracts and rule of law instead of retribution, recall that it’s often more work and suffering to sue or prosecute someone than the victim could ever hope to recover in material terms. Lawsuits are the means for self-destructive retaliation in a developed society, not the substitute.)
A game-theoretic argument for love would highlight one further point: marriage is a prisoner’s dilemma in which you get to choose the person with whom you’re in cahoots. This might seem like a small change, but it potentially has a big effect on the structure of the game you’re playing. If you knew that, for some reason, your partner in crime would be miserable if you weren’t around—the kind of misery even a million dollars couldn’t cure—then you’d worry much less about them defecting and leaving you to rot in jail. So the rational argument for love is twofold: the emotions of attachment not only spare you from recursively overthinking your partner’s intentions, but by changing the payoffs actually enable a better outcome altogether. What’s more, being able to fall involuntarily in love makes you, in turn, a more attractive partner to have. Your capacity for heartbreak, for sleeping with the emotional fishes, is the very quality that makes you such a trusty accomplice.
On the other hand, learning from others doesn’t always seem particularly rational. Fads and fashions are the result of following others’ behavior without being anchored to any underlying objective truth about the world. What’s worse, the assumption that other people’s actions are a useful guide can lead to the sort of herd-following that precipitates economic disaster. If everybody else is investing in real estate, it seems like a good idea to buy a house; after all, the price is only going to go up. Isn’t it?
In such a situation, it seems natural to look closely at your opponents’ bids, to augment your own meager private information with the public information. But this public information might not be nearly as informative as it seems. You don’t actually get to know the other bidders’ beliefs—only their actions. And it is entirely possible that their behavior is based on your own, just as your behavior is being influenced by theirs. It’s easy to imagine a bunch of people all going over a cliff together because “everyone else” was acting as though it’d all be fine—when in reality each person had qualms, but suppressed them because of the apparent confidence of everyone else in the group. Just as with the tragedy of the commons, this failure is not necessarily the players’ fault. An enormously influential paper by the economists Sushil Bikhchandani, David Hirshleifer, and Ivo Welch has demonstrated that under the right circumstances, a group of agents who are all behaving perfectly rationally and perfectly appropriately can nonetheless fall prey to what is effectively infinite misinformation. This has come to be known as an “information cascade.”
Information cascades offer a rational theory not only of bubbles, but also of fads and herd behavior more generally. They offer an account of how it’s easily possible for any market to spike and collapse, even in the absence of irrationality, malevolence, or malfeasance. The takeaways are several. For one, be wary of cases where public information seems to exceed private information, where you know more about what people are doing than why they’re doing it, where you’re more concerned with your judgments fitting the consensus than fitting the facts. When you’re mostly looking to others to set a course, they may well be looking right back at you to do the same. Second, remember that actions are not beliefs; cascades get caused in part when we misinterpret what others think based on what they do. We should be especially hesitant to overrule our own doubts—and if we do, we might want to find some way to broadcast those doubts even as we move forward, lest others fail to distinguish the reluctance in our minds from the implied enthusiasm in our actions. Last, we should remember from the prisoner’s dilemma that sometimes a game can have irredeemably lousy rules. There may be nothing we can do once we’re in it, but the theory of information cascades may help us to avoid such a game in the first place.
And if you’re the kind of person who always does what you think is right, no matter how crazy others think it is, take heart. The bad news is that you will be wrong more often than the herd followers. The good news is that sticking to your convictions creates a positive externality, letting people make accurate inferences from your behavior. There may come a time when you will save the entire herd from disaster.
In fact, there’s one auction design in particular that cuts through the burden of mental recursion like a hot knife through butter. It’s called the Vickrey auction. Named for Nobel Prize–winning economist William Vickrey, the Vickrey auction, just like the first-price auction, is a “sealed bid” auction pro cess. That is, every participant simply writes down a single number in secret, and the highest bidder wins. However, in a Vickrey auction, the winner ends up paying not the amount of their own bid, but that of the second-place bidder. That is to say, if you bid $25 and I bid $10, you win the item at my price: you only have to pay $10. To a game theorist, a Vickrey auction has a number of attractive properties. And to an algorithmic game theorist in particular, one property especially stands out: the participants are incentivized to be honest. In fact, there is no better strategy than just bidding your “true value” for the item—exactly what you think the item is worth. Bidding any more than your true value is obviously silly, as you might end up stuck buying something for more than you think it’s worth. And bidding any less than your true value (i.e., shading your bid) risks losing the auction for no good reason, since it doesn’t save you any money—because if you win, you’ll only be paying the value of the second-highest bid, regardless of how high your own was. This makes the Vickrey auction what mechanism designers call “strategy-proof,” or just “truthful.” In the Vickrey auction, honesty is literally the best policy.
Now, it seems like the Vickrey auction would cost the seller some money compared to the first-price auction, but this isn’t necessarily true. In a first-price auction, every bidder is shading their bid down to avoid overpaying; in the second-price Vickrey auction, there’s no need to—in a sense, the auction itself is optimally shading their bid for them. In fact, a game-theoretic principle called “revenue equivalence” establishes that over time, the average expected sale price in a first-price auction will converge to precisely the same as in a Vickrey auction. Thus the Vickrey equilibrium involves the same bidder winning the item for the same price—without any strategizing by any of the bidders whatsoever. As Tim Roughgarden tells his Stanford students, the Vickrey auction is “awesome.”
In fact, the lesson here goes far beyond auctions. In a landmark finding called the “revelation principle,” Nobel laureate Roger Myerson proved that any game that requires strategically masking the truth can be transformed into a game that requires nothing but simple honesty. Paul Milgrom, Myerson’s colleague at the time, reflects: “It’s one of those results that as you look at it from different sides, on the one side, it’s just absolutely shocking and amazing, and on the other side, it’s trivial. And that’s totally wonderful, it’s so awesome: that’s how you know you’re looking at one of the best things you can see.” The revelation principle may seem hard to accept on its face, but its proof is actually quite intuitive. Imagine that you have an agent or a lawyer who will be playing the game for you. If you trust them to represent your interests, you’re going to simply tell them exactly what you want, and let them handle all of the strategic bid-shading and the recursive strategizing on your behalf. In the Vickrey auction, the game itself performs this function. And the revelation principle just expands this idea: any game that can be played for you by agents to whom you’ll tell the truth, it says, will become an honesty-is-best game if the behavior you want from your agent is incorporated into the rules of the game itself. As Nisan puts it, “The basic thing is if you don’t want your clients to optimize against you, you’d better optimize for them. That’s the whole proof….If I design an algorithm that already optimizes for you, there is nothing you can do.”
Even the best strategy sometimes yields bad results—which is why computer scientists take care to distinguish between “process” and “outcome.” If you followed the best possible process, then you’ve done all you can, and you shouldn’t blame yourself if things didn’t go your way. Outcomes make news headlines—indeed, they make the world we live in—so it’s easy to become fixated on them. But processes are what we have control over. As Bertrand Russell put it, “it would seem we must take account of probability in judging of objective rightness….The objectively right act is the one which will probably be most fortunate. I shall define this as the wisest act.” We can hope to be fortunate—but we should strive to be wise. Call it a kind of computational Stoicism.
Likewise, seemingly innocuous language like “Oh, I’m flexible” or “What do you want to do tonight?” has a dark computational underbelly that should make you think twice. It has the veneer of kindness about it, but it does two deeply alarming things. First, it passes the cognitive buck: “Here’s a problem, you handle it.” Second, by not stating your preferences, it invites the others to simulate or imagine them. And as we have seen, the simulation of the minds of others is one of the biggest computational challenges a mind (or machine) can ever face.
In such situations, computational kindness and conventional etiquette diverge. Politely withholding your preferences puts the computational problem of inferring them on the rest of the group. In contrast, politely asserting your preferences (“Personally, I’m inclined toward x. What do you think?”) helps shoulder the cognitive load of moving the group toward resolution.
One of the chief goals of design ought to be protecting people from unnecessary tension, friction, and mental labor. (This is not just an abstract concern; when mall parking becomes a source of stress, for instance, shoppers may spend less money and return less frequently.) Urban planners and architects routinely weigh how different lot designs will use resources such as limited space, materials, and money. But they rarely account for the way their designs tax the computational resources of the people who use them. Recognizing the algorithmic underpinnings of our daily lives—in this case, optimal stopping—would not only allow drivers to make the best decisions when they’re in a particular scenario, but also encourage planners to be more thoughtful about the problems they’re forcing drivers into in the first place. There are a number of other cases where computationally kinder designs suggest themselves. For example, consider restaurant seating policies. Some restaurants have an “open seating” policy, where waiting customers simply hover until a table opens up, and the first to sit down gets the table. Others will take your name, let you have a drink at the bar, and notify you when a table is ready. These approaches to the management of scarce shared resources mirror the distinction in computer science between “spinning” and “blocking.” When a processing thread requests a resource and can’t get it, the computer can either allow that thread to “spin”—to continue checking for the resource in a perpetual “Is it ready yet?” loop—or it can “block”: halt that thread, work on something else, and then come back around whenever the resource becomes free. To a computer scientist, this is a practical tradeoff: weighing the time lost to spinning against the time lost in context switching. But at a restaurant, not all of the resources being traded off are their own. A policy of “spinning” fills empty tables faster, but the CPUs being worn out in the meantime are the minds of their customers, trapped in a tedious but consuming vigilance.
The limitations of a classical conception of rationality—which assumes infinite computational capacity and infinite time to solve a problem—were famously pointed out by the psychologist, economist, and artificial intelligence pioneer Herbert Simon in the 1950s (Simon, Models of Man), ultimately leading to a Nobel Prize. Simon argued that “bounded rationality” could provide a better account of human behavior. Simon’s insight has been echoed in mathematics and computer science. Alan Turing’s colleague I. J. Good (famous for the concept of “the singularity” and for advising Stanley Kubrick about HAL 9000 for 2001: A Space Odyssey) called this sort of thinking “Type II Rationality.” Whereas classic, old-fashioned Type I Rationality just worries about getting the right answer, Type II Rationality takes into account the cost of getting that answer, recognizing that time is just as important a currency as accuracy.
Artificial intelligence experts of the twenty-first century have also argued that “bounded optimality”—choosing the algorithm that best trades off time and error—is the key to developing functional intelligent agents. This is a point made by, for instance, UC Berkeley computer scientist Stuart Russell—who literally cowrote the book on artificial intelligence (the bestselling textbook Artificial Intelligence: A Modern Approach)—and by Eric Horvitz, managing director at Microsoft Research. See, for example, Russell and Wefald, Do the Right Thing, and Horvitz and Zilberstein, “Computational Tradeoffs Under Bounded Resources.” Tom and his colleagues have used this approach to develop models of human cognition; see Griffiths, Lieder, and Goodman, “Rational Use of Cognitive Resources.”
Selling a house is similar: The house-selling problem is analyzed in Sakaguchi, “Dynamic Programming of Some Sequential Sampling Design”; Chow and Robbins, “A Martingale System Theorem and Applications”; and Chow and Robbins, “On Optimal Stopping Rules.” We focus on the case where there are potentially infinitely many offers, but these authors also provide optimal strategies when the number of potential offers is known and finite (which are less conservative—you should have a lower threshold if you only have finitely many opportunities). In the infinite case, you should set a threshold based on the expected value of waiting for another offer, and take the first offer that exceeds that threshold.
Chapter 17 of Shoup’s The High Cost of Free Parking discusses the optimal on-street parking strategy when pricing creates an average of one free space per block, which, as Shoup notes, “depends on the conflict between greed and sloth” (personal correspondence). The question of whether to “cruise” for cheap on-street spots or to pay for private parking spaces is taken up in Shoup’s chapter 13.