davidstephens.io

The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World

Pedro Domingos
Ended: June 1, 2017
        
              Once the inevitable happens and learning algorithms become the middlemen, power becomes concentrated in them. Google’s algorithms largely determine what information you find, Amazon’s what products you buy, and Match.com’s who you date. The last mile is still yours—choosing from among the options the algorithms present you with—but 99.9 percent of the selection was done by them. The success or failure of a company now depends on how much the learners like its products, and the success of a whole economy—whether everyone gets the best products for their needs at the best price—depends on how good the learners are.
            
              In the same way that a bank without databases can’t compete with a bank that has them, a company without machine learning can’t keep up with one that uses it. While the first company’s experts write a thousand rules to predict what its customers want, the second company’s algorithms learn billions of rules, a whole set of them for each individual customer. It’s about as fair as spears against machine guns.
            
              As knowledge grows, scientists specialize ever more narrowly, but no one is able to put the pieces together because there are far too many pieces. Scientists collaborate, but language is a very slow medium of communication. Scientists try to keep up with others’ research, but the volume of publications is so high that they fall farther and farther behind. Often, redoing an experiment is easier than finding the paper that reported it. Machine learning comes to the rescue, scouring the literature for relevant information, translating one area’s jargon into another’s, and even making connections that scientists weren’t aware of. Increasingly, machine learning acts as a giant hub, through which modeling techniques invented in one field make their way into others.
            
              There’s a further twist: once a learned program is deployed, the bad guys change their behavior to defeat it. This contrasts with the natural world, which always works the same way. The solution is to marry machine learning with game theory, something I’ve worked on in the past: don’t just learn to defeat what your opponent does now; learn to parry what he might do against your learner. Factoring in the costs and benefits of different actions, as game theory does, can also help strike the right balance between privacy and security.
            
              consider Naïve Bayes, a learning algorithm that can be expressed as a single short equation. Given a database of patient records—their symptoms, test results, and whether or not they had some particular condition—Naïve Bayes can learn to diagnose the condition in a fraction of a second, often better than doctors who spent many years in medical school. It can also beat medical expert systems that took thousands of person-hours to build. The same algorithm is widely used to learn spam filters, a problem that at first sight has nothing to do with medical diagnosis. Another simple learner, called the nearest-neighbor algorithm, has been used for everything from handwriting recognition to controlling robot hands to recommending books and movies you might like. And decision tree learners are equally apt at deciding whether your credit-card application should be accepted, finding splice junctions in DNA, and choosing the next move in a game of chess.
            
              all the major learners—including nearest-neighbor, decision trees, and Bayesian networks, a generalization of Naïve Bayes—are universal in the following sense: if you give the learner enough of the appropriate data, it can approximate any function arbitrarily closely—which is math-speak for learning anything. The catch is that “enough data” could be infinite. Learning from finite data requires making assumptions, as we’ll see, and different learners make different assumptions, which makes them good for some things but not others.
            
              All of this is evidence that the brain uses the same learning algorithm throughout, with the areas dedicated to the different senses distinguished only by the different inputs they are connected to (e.g., eyes, ears, nose). In turn, the associative areas acquire their function by being connected to multiple sensory regions, and the “executive” areas acquire theirs by connecting the associative areas and motor output.
            
              The cerebellum, the evolutionarily older part of the brain responsible for low-level motor control, has a clearly different and very regular architecture, built out of much smaller neurons, so it would seem that at least motor learning uses a different algorithm. If someone’s cerebellum is injured, however, the cortex takes over its function. Thus it seems that evolution kept the cerebellum around not because it does something the cortex can’t, but just because it’s more efficient.
            
              The computations taking place within the brain’s architecture are also similar throughout. All information in the brain is represented in the same way, via the electrical firing patterns of neurons. The learning mechanism is also the same: memories are formed by strengthening the connections between neurons that fire together, using a biochemical process known as long-term potentiation. All this is not just true of humans: different animals have similar brains. Ours is unusually large, but seems to be built along the same principles as other animals’. Another line of argument for the unity of the cortex comes from what might be called the poverty of the genome. The number of connections in your brain is over a million times the number of letters in your genome, so it’s not physically possible for the genome to specify in detail how the brain is wired.
            
              Evolution is an algorithm. Paraphrasing Charles Babbage, the Victorian-era computer pioneer, God created not species but the algorithm for creating species. The “endless forms most beautiful” Darwin spoke of in the conclusion of The Origin of Species belie a most beautiful unity: all of those forms are encoded in strings of DNA, and all of them come about by modifying and combining those strings. Who would have guessed, given only a description of this algorithm, that it could produce you and me? If evolution can learn us, it can conceivably also learn everything that can be learned, provided we implement it on a powerful enough computer. Indeed, evolving programs by simulating natural selection is a popular endeavor in machine learning. Evolution, then, is another promising path to the Master Algorithm.
            
              A related, frequently heard objection is “Data can’t replace human intuition.” In fact, it’s the other way around: human intuition can’t replace data. Intuition is what you use when you don’t know the facts, and since you often don’t, intuition is precious. But when the evidence is before you, why would you deny it? Statistical analysis beats talent scouts in baseball (as Michael Lewis memorably documented in Moneyball), it beats connoisseurs at wine tasting, and every day we see new examples of what it can do.
            
              What in my job can be done by a learning algorithm, what can’t, and—most important—how can I take advantage of machine learning to do it better? The computer is your tool, not your adversary. Armed with machine learning, a manager becomes a supermanager, a scientist a superscientist, an engineer a superengineer. The future belongs to those who understand at a very deep level how to combine their unique expertise with what algorithms do best.
            
              Control of data and ownership of the models learned from it is what many of the twenty-first century’s battles will be about—between governments, corporations, unions, and individuals. But you also have an ethical duty to share data for the common good. Machine learning alone will not cure cancer; cancer patients will, by sharing their data for the benefit of future patients.
            
              A theory is a set of constraints on what the world could be, not a complete description of it. To obtain the latter, you have to combine the theory with data.
            
              The power of a theory lies in how much it simplifies our description of the world. Armed with Newton’s laws, we only need to know the masses, positions, and velocities of all objects at one point in time; their positions and velocities at all times follow. So Newton’s laws reduce our description of the world by a factor of the number of distinguishable instants in the history of the universe, past and future.
            
              Overfitting happens when you have too many hypotheses and not enough data to tell them apart.
            
              Perceptrons were invented in the late 1950s by Frank Rosenblatt, a Cornell psychologist. A charismatic speaker and lively character, Rosenblatt did more than anyone else to shape the early days of machine learning. The name perceptron derives from his interest in applying his models to perceptual tasks like speech and character recognition. Rather than implement perceptrons in software, which was very slow in those days, Rosenblatt built his own devices. The weights were implemented by variable resistors like those found in dimmable light switches, and weight learning was carried out by electric motors that turned the knobs on the resistors.
            
              A perceptron models only a single neuron’s learning, however, and although Minsky and Papert acknowledged that layers of interconnected neurons should be capable of more, they didn’t see a way to learn them. Neither did anyone else. The problem is that there’s no clear way to change the weights of the neurons in the “hidden” layers to reduce the errors made by the ones in the output layer. Every hidden neuron influences the output via multiple paths, and every error has a thousand fathers. Who do you blame? Or, conversely, who gets the credit for correct outputs? This credit-assignment problem shows up whenever we try to learn a complex model and is one of the central problems in machine learning.
            
              Machine learning at the time was associated mainly with neural networks, and most researchers (not to mention funders) concluded that the only way to build an intelligent system was to explicitly program it. For the next fifteen years, knowledge engineering would hold center stage, and machine learning seemed to have been consigned to the ash heap of history.
            
              Backpropagation, as this algorithm is known, is phenomenally more powerful than the perceptron algorithm. A single neuron could only learn straight lines. Given enough hidden neurons, a multilayer perceptron, as it’s called, can represent arbitrarily convoluted frontiers. This makes backpropagation—or simply backprop—the connectionists’ master algorithm.
            
              Hyperspace is a double-edged sword. On the one hand, the higher dimensional the space, the more room it has for highly convoluted surfaces and local optima. On the other hand, to be stuck in a local optimum you have to be stuck in every dimension, so it’s more difficult to get stuck in many dimensions than it is in three. In hyperspace there are mountain passes all over the (hyper) place.
            
              Backprop was invented in 1986 by David Rumelhart, a psychologist at the University of California, San Diego, with the help of Geoff Hinton and Ronald Williams.
            
              Nonlinear models are important far beyond the stock market. Scientists everywhere use linear regression because that’s what they know, but more often than not the phenomena they study are nonlinear, and a multilayer perceptron can model them. Linear models are blind to phase transitions; neural networks soak them up like a sponge.
            
              In truth, connectionists have made genuine progress. One of the protagonists of this latest twist in the connectionist roller coaster is an unassuming little device called an autoencoder. An autoencoder is a multilayer perceptron whose output is the same as its input. In goes a picture of your grandmother and out comes—the same picture of your grandmother. At first this seems like a silly idea: What use could such a contraption possibly be? The key is to make the hidden layer much smaller than the input and output layers, so the network can’t just learn to copy the input to the hidden layer and the hidden layer to the output, in which case we may as well throw the whole thing out. But if the hidden layer is small, something interesting happens: the network is forced to encode the input in fewer bits, so it can be represented in the hidden layer, and then decode those bits back to full size. It could, for example, learn to encode a million-pixel image of your grandmother as just the seven-character word grandma, or some such short code invented by itself, and simultaneously learn to decode “grandma” into an image of dear old granny. So an autoencoder is not unlike a file compression tool, with two important advantages: it figures out how to compress things on its own, and like Hopfield networks, it can turn a noisy, distorted image into a nice clean one.
            
              autoencoders didn’t really catch on. The trick that took over a decade to discover was to make the hidden layer larger than the input and output ones. Huh? Actually, that’s only half the trick: the other half is to force all but a few of the hidden units to be off at any given time. This still prevents the hidden layer from just copying the input, and—crucially—it makes learning much easier. If we allow different bits to represent different inputs, the inputs no longer have to compete to set the same bits. Also, the network now has many more parameters, so the hyperspace you’re in has many more dimensions, and you have many more ways to get out of what would otherwise be local maxima. This is called a sparse autoencoder, and it’s a neat trick.
            
              The next clever idea is to stack sparse autoencoders on top of each other like a club sandwich. The hidden layer of the first autoencoder becomes the input/output layer of the second one, and so on. Because the neurons are nonlinear, each hidden layer learns a more sophisticated representation of the input, building on the previous one. Given a large set of face images, the first autoencoder learns to encode local features like corners and spots, the second uses those to encode facial features like the tip of a nose or the iris of an eye, the third one learns whole noses and eyes, and so on. Finally, the top layer can be a conventional perceptron that learns to recognize your grandmother from the high-level features provided by the layer below it—much easier than using only the crude information provided by a single hidden layer or than trying to backpropagate through all the layers at once. The Google…
            
              Notice how much genetic algorithms differ from multilayer perceptrons. Backprop entertains a single hypothesis at any given time, and the hypothesis changes gradually until it settles into a local optimum. Genetic algorithms consider an entire population of hypotheses at each step, and these can make big jumps from one generation to the next, thanks to crossover. Backprop proceeds deterministically after setting the initial weights to small random values. Genetic algorithms, in contrast, are full of random choices: which hypotheses to keep alive and cross over (with fitter hypotheses being more likely candidates), where to cross two strings, which bits to mutate. Backprop learns weights for a predefined network architecture; denser networks are more flexible but also harder to learn. Genetic algorithms make no a priori assumptions about the structures they will learn, other than their general form.
            
              No one is sure why sex is pervasive in nature, either. Several theories have been proposed, but none is widely accepted. The leader of the pack is the Red Queen hypothesis, popularized by Matt Ridley in the eponymous book. As the Red Queen said to Alice in Through the Looking Glass, “It takes all the running you can do, to keep in the same place.” In this view, organisms are in a perpetual arms race with parasites, and sex helps keep the population varied, so that no single germ can infect all of it. If this is the answer, then sex is irrelevant to machine learning, at least until learned programs have to vie with computer viruses for processor time and memory. (Intriguingly, Danny Hillis claims that deliberately introducing coevolving parasites into a genetic algorithm can help it escape local maxima by gradually ratcheting up the difficulty, but no one has followed up on this yet.) Christos Papadimitriou and colleagues have shown that sex optimizes not fitness but what they call mixability: a gene’s ability to do well on average when combined with other genes. This can be useful when the fitness function is either not known or not constant, as in natural selection, but in machine learning and optimization, hill climbing tends to do better.
            
              In biology, this is called the Baldwin effect, after J. M. Baldwin, who proposed it in 1896. In Baldwinian evolution, behaviors that are first learned later become genetically hardwired. If dog-like mammals can learn to swim, they have a better chance to evolve into seals—as they did—than if they drown. Thus individual learning can influence evolution without recourse to Lamarckism. Geoff Hinton and Steven Nowlan demonstrated the Baldwin effect in machine learning by using genetic algorithms to evolve neural network structure and observing that fitness increased over time only when individual learning was allowed.
            
              Evolution searches for good structures, and neural learning fills them in: this combination is the easiest of the steps we’ll take toward the Master Algorithm.
            
              A learner that uses Bayes’ theorem and assumes the effects are independent given the cause is called a Naïve Bayes classifier. That’s because, well, that’s such a naïve assumption.
            
              If the states and observations are continuous variables instead of discrete ones, the HMM becomes what’s known as a Kalman filter. Economists use Kalman filters to remove noise from time series of quantities like GDP, inflation, and unemployment. The “true” GDP values are the hidden states; at each time step, the true value should be similar to the observed one, but also to the previous true value, since the economy seldom makes abrupt jumps. The Kalman filter trades off these two, yielding a smoother curve that still accords with the observations. When a missile cruises to its target, it’s a Kalman filter that keeps it on track. Without it, there would have been no man on the moon.
            
              Perhaps in a future cyber-court, in session somewhere on Amazon’s cloud, a robo-lawyer will beat the speeding ticket that RoboCop issued to your driverless car, all while you go to the beach, and Leibniz’s dream of reducing all argument to calculation will finally have come true.
            
              So if you give me a hundred numbers or so, that should be enough to re-create a picture of a face. Conversely, Robby’s brain should be able to take in a picture of a face and quickly reduce it to the hundred numbers that really matter. Machine learners call this process dimensionality reduction because it reduces a large number of visible dimensions (the pixels) to a few implicit ones (expression, facial features). Dimensionality reduction is essential for coping with big data—like the data coming in through your senses every second.
            
              Principal-component analysis (PCA), as this process is known, is one of the key tools in the scientist’s toolkit. You could say PCA is to unsupervised learning what linear regression is to the supervised variety. The famous hockey-stick curve of global warming, for example, is the result of finding the principal component of various temperature-related data series (tree rings, ice cores, etc.) and assuming it’s the temperature.
            
              Psychologists have found that personality boils down to five dimensions—extroversion, agreeableness, conscientiousness, neuroticism, and openness to experience—which they can infer from your tweets and blog posts.
            
              Applying PCA to congressional votes and poll data shows that, contrary to popular belief, politics is not mainly about liberals versus conservatives. Rather, people differ along two main dimensions: one for economic issues and one for social ones. Collapsing these into a single axis mixes together populists and libertarians, who are polar opposites, and creates the illusion of lots of moderates in the middle. Trying to appeal to them is an unlikely winning strategy. On the other hand, if liberals and libertarians overcame their mutual aversion, they could ally themselves on social issues, where both favor individual freedom.
            
              Children’s play is a lot more serious than it looks; if evolution made a creature that is helpless and a heavy burden on its parents for the first several years of its life, that extravagant cost must be for the sake of an even bigger benefit. In effect, reinforcement learning is a kind of speeded-up evolution—trying, discarding, and refining actions within a single lifetime instead of over generations—and by that standard it’s extremely efficient.
            
              More recently, a reinforcement learner from DeepMind, a London-based startup, beat an expert human player at Pong and other simple arcade games. It used a deep network to predict actions’ values from the console screen’s raw pixels. With its end-to-end vision, learning, and control, the system bore at least a passing resemblance to an artificial brain. This may help explain why Google paid half a billion dollars for DeepMind, a company with no products, no revenues, and few employees.
            
              In Isaac Asimov’s Foundation, the scientist Hari Seldon manages to mathematically predict the future of humanity and thereby save it from decadence. Paul Krugman, among others, has confessed that this seductive dream was what made him become an economist. According to Seldon, people are like molecules in a gas, and the law of large numbers ensures that even if individuals are unpredictable, whole societies aren’t. Relational learning reveals why this is not the case. If people were independent, each making decisions in isolation, societies would indeed be predictable, because all those random decisions would add up to a fairly constant average. But when people interact, larger assemblies can be less predictable than smaller ones, not more. If confidence and fear are contagious, each will dominate for a while, but every now and then an entire society will swing from one to the other. It’s not all bad news, though. If we can measure how strongly people influence each other, we can estimate how long it will be before a swing occurs, even if it’s the first one—another way in which black swans are not necessarily unpredictable.
            
              The Netflix Prize winner used metalearning to combine hundreds of different learners. Watson uses it to choose its final answer from the available candidates. Nate Silver combines polls in a similar way to predict election results.
            
              An even simpler metalearner is bagging, invented by the statistician Leo Breiman. Bagging generates random variations of the training set by resampling, applies the same learner to each one, and combines the results by voting. The reason to do this is that it reduces variance: the combined model is much less sensitive to the vagaries of the data than any single one, making this a remarkably easy way to improve accuracy.
            
              If the models are decision trees and we further vary them by withholding a random subset of the attributes from consideration at each node, the result is a so-called random forest. Random forests are some of the most accurate classifiers around.
            
              One of the cleverest metalearners is boosting, created by two learning theorists, Yoav Freund and Rob Schapire. Instead of combining different learners, boosting repeatedly applies the same classifier to the data, using each new model to correct the previous ones’ mistakes. It does this by assigning weights to the training examples; the weight of each misclassified example is increased after each round of learning, causing later rounds to focus more on it. The name boosting comes from the notion that this process can boost a classifier that’s only slightly better than random guessing, but consistently so, into one that’s almost perfect.
            
              You can download the learner I’ve just described from alchemy.cs.washington.edu.
            
              One of Alchemy’s largest applications to date was to learn a semantic network (or knowledge graph, as Google calls it) from the web. A semantic network is a set of concepts (like planets and stars) and relations among those concepts (planets orbit stars). Alchemy learned over a million such patterns from facts extracted from the web (e.g., Earth orbits the sun). It discovered concepts like planet all by itself. The version we used was more advanced than the basic one I’ve described here, but the essential ideas are the same. Various research groups have used Alchemy or their own MLN implementations to solve problems in natural language processing, computer vision, activity recognition, social network analysis, molecular biology, and many other areas.
            
              The effort to build what will ultimately become CanceRx is already under way. Researchers in the new field of systems biology model whole metabolic networks rather than individual genes and proteins. One group at Stanford has built a model of a whole cell. The Global Alliance for Genomics and Health promotes data sharing among researchers and oncologists, with a view to large-scale analysis. CancerCommons.org assembles cancer models and lets patients pool their histories and learn from similar cases. Foundation Medicine pinpoints the mutations in a patient’s tumor cells and suggests the most appropriate drugs. A decade ago, it wasn’t clear if, or how, cancer would ever be cured. Now we can see how to get there. The road is long, but we have found it.
            
              Your digital future begins with a realization: every time you interact with a computer—whether it’s your smart phone or a server thousands of miles away—you do so on two levels. The first one is getting what you want there and then: an answer to a question, a product you want to buy, a new credit card. The second level, and in the long run the most important one, is teaching the computer about you. The more you teach it, the better it can serve you—or manipulate you. Life is a game between you and the learners that surround you. You can refuse to play, but then you’ll have to live a twentieth-century life in the twenty-first. Or you can play to win. What model of you do you want the computer to have? And what data can you give it that will produce that model? Those two questions should always be in the back of your mind whenever you interact with a learning algorithm
            
              Companies like Acxiom collate and sell information about you, but if you inspect it (which in Acxiom’s case you can, at aboutthedata.com), it’s not much, and some of it is wrong.
            
              The kind of company I’m envisaging would do several things in return for a subscription fee. It would anonymize your online interactions, routing them through its servers and aggregating them with its other users’. It would store all the data from all your life in one place—down to your 24/7 Google Glass video stream, if you ever get one. It would learn a complete model of you and your world and continually update it. And it would use the model on your behalf, always doing exactly what you would, to the best of the model’s ability. The company’s basic commitment to you is that your data and your model will never be used against your interests. Such a guarantee can never be foolproof—you yourself are not guaranteed to never do anything against your interests, after all. But the company’s life would depend on it as much as a bank’s depends on the guarantee that it won’t lose your money, so you should be able to trust it as much as you trust your bank.