Like, for example, the fact that the size of a fruit fly’s brain—when the number of neurons are plotted logarithmically—lies almost exactly halfway between the human brain and no brain at all. “When I started my scientific career, I studied the brain of the fly,” says Poggio. Nowadays, investigating that space between “brain” and “no brain” is what drives Poggio, the Eugene McDermott Professor of Brain and Cognitive Sciences, as he directs the Center for Brains, Minds, and Machines (CBMM), a multi-institutional collaboration headquartered at MIT’s McGovern Institute for Brain Research.
CBMM’s mission, in Poggio’s words, is no less than “understanding the problem of intelligence—not only the engineering problem of building an intelligent machine, but the scientific problem of what intelligence is, how our brain works and how it creates the mind.” To Poggio, whose multidisciplinary background also includes physics, mathematics, and computer science, the question of how intelligence mysteriously arises out of certain arrangements of matter and not others “is not only one of the great problems in science, like the origin of the universe—it’s actually the greatest of all, because it means understanding the very tool we use to understand everything else: our mind.”
One of Poggio’s primary fascinations is the behavior of so-called “deep-learning” neural networks. These computer systems are very roughly modeled on the arrangement of neurons in certain regions of biological brains. A neural network is termed “deep” when it passes information among multiple layers of digital connections in between the input and the output. These hidden layers may number anywhere from the dozens to the thousands, and their unusual pattern-matching capabilities power many of today’s “artificial intelligence” applications—from the speech recognition algorithms in smartphones to the software that helps guide self-driving cars. “It’s intriguing to me that these software models, which are based on such a coarse approximation of neurons and have very few biologically based constraints, not only perform well in a number of difficult pattern-recognition problems—they also seem to be predicting some of the properties of actual neurons in the visual cortex of monkeys,” Poggio explains. The question is: why?
The truth is, nobody knows—even as the technology of deep learning accelerates at an ever-quickening pace. “The theoretical understanding of these systems is lagging behind the application,” says Lorenzo Rosasco, a machine learning researcher who collaborates with Poggio at CBMM. To Poggio, this gap in fundamental theory is “pretty typical” for doing groundbreaking science. “People didn’t really understand at first why a battery works or what electricity is—it was just experimentally found,” he explains. “Then from studying it, there is a theory that develops, and this is what is important for further progress.”
What Couloumb and Ohm did for electricity, Poggio wants to do for deep neural networks: to begin defining a theory. He, Rosasco, and a dozen other CBMM collaborators recently published a set of three papers that does just that. The field of machine learning already has several decades’ worth of theoretical understanding applied to what Poggio calls “shallow” neural networks— generally, systems with only one layer in between the input and output. But deep-learning networks are much more powerful (as the latest tech-industry headlines readily confirm). “Basically there is no good theory for why deep networks work better than these one-layer networks,” Poggio says. Each of his three papers addresses one piece of that theoretical puzzle—from the technical details all the way up to their (in Poggio’s words) “philosophical” implications.
Breaking the Curse
The first paper in the trio has a disarmingly layman-friendly title: “Why and When Can Deep—but Not Shallow—Networks Avoid the Curse of Dimensionality: A Review.” This “curse” may sound like something J. K. Rowling might dream up if she were writing a physics textbook. But it’s actually a well-known mathematical thorn in the side of any researcher who’s had to tangle with large, complex sets of data—precisely the kind of so-called big data that deep-learning networks are increasingly being used to make sense of in science and industry.
“Dimensionality” refers to the number of parameters that a data point contains. A point in physical space, for example, exists in three dimensions defined by length, height, and depth. Many phenomena of interest to science, however—for example, gene expression in an organism or ecological interactions in an environment—generate data with thousands (or more) parameters for every point. “These parameters are like knobs or dials” on a complicated machine, says Poggio. To model these “high-dimensional” systems mathematically, equations are needed that can specify every possible state of every available “knob.” Mathematicians have proven that a one-layer neural network can—in theory—model any kind of system to any degree of accuracy, no matter how many of these dimensions (or “knobs”) it contains. There’s just one problem: “it will take an enormous amount of resources” in time and computing power, Poggio says.
Deep neural networks, however, seem to be able to escape this “curse of dimensionality” under certain conditions. Take image-classifying software, for example. A deep neural network trained to detect the image of a school bus in a 32-by-32 grid of pixels would be considered primitive by contemporary standards—after all, smartphone apps can routinely recognize faces in photos containing millions of pixels. And yet the number of parameters, or “knobs,” in even that 32-by-32 pixel grid is astronomical: “a one followed by a thousand zeros,” says Poggio. Why can deep-learning networks handle such seemingly intractable tasks with aplomb?
To Poggio and Rosasco (who co-authored the first paper with colleagues from the California Institute of Technology and Claremont Graduate University), the answer may reside in a special set of mathematical relationships called compositional functions.
A function is any equation that transforms an input to an output: for example, f(x) = 2x means “for any number given as an input, the output will be double that number.” A compositional function behaves the same way, except that instead of using numbers as inputs, it uses other functions—creating a structure that resembles a tree, with functions composed from other functions, and so on.
The mathematics of this tree can become incredibly complicated. But, significantly, the hierarchical structure of compositional functions mirrors the architecture of deep neural networks—a dense web of layered connections. And it just so happens that computational tasks that involve classifying patterns composed of constituent parts— like recognizing the features of a school bus or a face in an array of pixels—are described by compositional functions, too. Something about this hand-in-glove “fit” among the structures of deep neural networks, compositional functions, and pattern-recognition tasks causes the curse of dimensionality to disappear.
Not only does Poggio’s theory provide a roadmap for what kinds of problems deep-learning networks are ideally equipped to solve—it also sheds light on what kinds of tasks these networks probably won’t handle especially well. In an age when “artificial intelligence” is often hyped as a technological panacea, Poggio’s work demystifies neural networks. “There’s often a suggestion that there is something ‘magical’ in the way deep-learning systems can learn,” says Rosasco. “This paper is basically saying, ‘Okay, there are also some other theoretical considerations that actually seem to be able to, at least qualitatively, make sense of this.’” In other words, if a complicated task or problem can be described using compositional functions, a deep neural network may be the best computational tool to approach it with. But if the problem’s complexity doesn’t match the language of compositional functions, neural networks won’t “magically” handle it any better than other computer architectures will.
Poggio’s other two theoretical papers also use clever mathematics to attempt to bring some other “magical”- seeming features of deep neural networks down to earth. The second paper uses an algebra concept called Bezout’s theorem to explain how these networks can be successfully trained (or “optimized”) using what conventional statistics practices would deem to be low-quality data; the third explains why deep-learning systems, once trained, are able to make relatively accurate predictions about data they haven’t been exposed to before using a method that Poggio likens to a machine-learning version of Occam’s razor (the philosophical principle that states that simpler explanations for a phenomenon are more likely to be true than complicated ones).
For Poggio, the implications of these theories raise “some interesting philosophical questions” about the similarities between our own brains and the deep neural networks that “crudely” (in his words) model them. The fact that both deep-learning networks and our own cognitive machinery seem to “prefer” processing compositional functions, for example, strikes Poggio as more than mere coincidence. “For certain problems like vision, it’s kind of obvious that you can recognize objects and then put them together in a scene,” he says. “Text and speech have this structure, too. You have letters, you have words, then you compose words in sentences, sentences in paragraphs, and so on. Compositionality is what language is.” If deep neural networks and our own brains are “wired up” in similar ways, Poggio says, “then you would expect our brains to do well with problems that are compositional”—just as deep-learning systems do.
Can a working theory of deep neural networks begin to crack the puzzle of intelligence itself? “Success stories in this area are not that many,” admits Rosasco. “But Tommy [Poggio] is older and braver than me, so I decided, ‘Yeah, I’ll follow him into it.’” Speaking for himself, Poggio certainly sounds like an enthusiastic pioneer. “You want a theory for two reasons,” he asserts. “One is basic curiosity: Why does it work? The second reason is hope: that it can tell you where to go next.”