From the earliest days of computing, computer scientists believed computers would one day rival human intelligence. In 1949, four years before IBM shipped its first commercial computer, renowned mathematician Alan Turing predicted, “I do not see why it (the machine) should not enter any one of the fields normally covered by the human intellect, and eventually compete on equal terms. I do not think you even draw the line about sonnets. . .” (Turing, an Englishman, was perhaps referring to the 174 Shakespearean sonnets considered among the greatest works in English literature.)
But for the next sixty years, progress in Artificial Intelligence was slow. By 1957, the powerful IBM 704 computer could only play beginner-level chess. Taking eight minutes to calculate each move, the IBM 704 perhaps defeated its opponents through sheer boredom.
It would be another forty years, in 1997, before IBM’s Deep Blue supercomputer defeated World Chess Champion Garry Kasparov in a six-game chess match held under standard rules.
As impressive as its win over Kasparov was, Deep Blue played chess much as its IBM 704 predecessor had in 1957. While the IBM 704 analyzed two moves ahead, each involving a black and white piece, Deep Blue typically analyzed six to eight moves ahead. That may not sound like much of an improvement, but calculations grow exponentially with the number of moves looked ahead. Deep Blue analyzed 200 million chess positions a second allowing it to evaluate 36 billion board positions in the three minutes allotted for a chess player’s turn.
Kasparov resented Deep Blue’s robotic approach. “Deep Blue,” Kasparov later wrote, “was intelligent the way your programmable alarm clock is intelligent. Not that losing to a $10 million alarm clock made me feel any better.”
Kasparov’s sentiments were understandable. Still, Deep Blue’s win over the world’s best chess player was a significant milestone in the evolution of Artificial Intelligence.
Another very public milestone in Artificial Intelligence occurred in 2011 when IBM’s Watson defeated Jeopardy’s top champions, Brad Rutter and Ken Jennings. Millions of Jeopardy fans, cheering for Rutter and Jennings, were astonished. For decades, Jeopardy has been regarded as the most challenging quiz show on television.
But Watson was undeniably impressive, deftly responding to natural language questions (what Jeopardy calls “answers”) filled with puns, wordplay, and subtle associations. During the two-day match, Watson answered sixty-six questions correctly and nine incorrectly winning $77,147 compared to his rivals’ $45,600 in combined winnings.
Watson’s impressive performance though was not quite what it appeared. For five years, a team of IBM computer scientists had worked diligently in a massive effort to build a system specifically to win at Jeopardy. The work was tedious, downloading and indexing millions of articles from internet sources that IBM knew Jeopardy based its questions on.
Implicitly acknowledging it had constructed a database specifically for Jeopardy, IBM later declared, “Watson’s main innovation centered on its ability to quickly execute hundreds of algorithms to simultaneously analyze a question from many directions…” In short, Watson’s breakthrough was understanding the question, not answering it.
IBM’s Deep Blue and Watson were specifically developed to brilliantly play two games: chess and Jeopardy. But Deep Blue and Watson were autistic savants, entities with exceptional skills in a specific area but unable to function outside their narrow expertise.
To fulfill Alan Turing’s prediction that Artificial Intelligence would someday “compete on equal terms” with humans, a radical new approach would be needed.
In 1951, Marvin Minsky published his master’s thesis at Princeton. In it, Minsky proposed that networks of simple processing units—modeled after biological neurons—could simulate human learning. Unlike step-by-step computer programs, Minksy’s artificial neural networks could learn by trial and error, similar to how biological organisms adapt to their environment.
To demonstrate his theory, Minsky built a simple neural network from electrical components that simulated how a rat, through trial and error, learned to navigate a maze. Minsky’s neural networks would become the key to today’s powerful Artificial Intelligence.
For decades after Minsky’s seminal paper, research in neural networks proceeded in fits and starts. It wasn’t until the late 1980s that neural networks, programmed to run on a conventional computer, solved a useful problem: optical character recognition.
Humans can typically recognize written letters and numbers regardless of how poorly they may be written. This is extremely difficult though for a computer programmed to solve problems step-by-step. But artificial neural networks, like those in our biological brains, excel at this seemingly simple but actually quite complex task.
In the figure above, the number “9” is represented within a 28 by 28 image grid where each of the 784 (28 x 28) grid elements is a number from zero to 100 based on the brightness of that particular grid element. A totally black element would have a value of zero while a totally white element would be 100 with gradations in between. A medium gray element, for instance, might have a value of 50.
The four columns on the right represent a simplified neural network. The leftmost column, the input column, consists of 784 data cells. Each cell contains the value of a specific grid element ordered from the top left to the bottom right element of the image grid.
The data cells in the rightmost column are output cells. Each cell represents a digit from zero to nine. The value placed in an output cell is the neural network’s numerical estimate (from zero to 100) that the number in the image grid corresponds to that output cell.
The two middle columns, known as “layers“ by computer scientists, are the heart of the neural network. Let’s call the first layer the shapes layer, and the second layer the assembly layer. Each cell in the two layers is an artificial neuron.
Together, neurons in the shapes layer identify simple shapes within the image grid: a diagonal line on the right, an oval at the bottom, a vertical line on the left, a curved line in the middle, for example. These shapes constitute the parts of a digit that might be located in the image grid. For example, a “9” consists of an oval connected on the right to either a vertical or diagonal line which may be either straight or gently curved.
The output of the shapes layer—lots of bits and pieces—is then evaluated by the assembly layer which attempts to assemble a fully formed digit from the shapes identified in the shapes layer.
The ten output cells summarize the data collected in the assembly layer. The output cell gathering the strongest responses from the assembly layer determines which digit the neural network believes is located in the image grid.
How are the values in the shapes and assembly layers calculated? Even for the seemingly simple case of ten digits, recognizing a single digit is complex since both its shape and location within the image grid can vary tremendously.
The calculations made in the shapes and assembly layers are determined through a process of “training.” In neural networks training replaces the tedious, manual programming done in conventional computers. Today’s commercial AI systems are trained using massive datasets—encyclopedias, books, legal texts, movie scripts, and much more—swept up from the internet.
Our simple neural network though can be trained using standardized, commercial data: the MNIST dataset. This off-the-shelf database contains 70,000 images of handwritten digits. Each image is a 28 by 28 grayscale grid similar to our sample above.
For our simple neural network, training consists of submitting each of the 70,000 images to the neural network, assessing the output cells for accuracy, and then making incremental adjustments to the neurons in the shapes and assembly layers to increase the network’s accuracy.
Each neuron in the shapes layer is connected to all the input cells. The influence each of the 784 inputs has on a neuron is controlled by adjusting that input’s weight. Similar to a water faucet, an input’s weight controls how much of the input “flows” into the neuron. The neuron’s 784 weighted inputs are then summed. The sum is next adjusted by the bias and activation functions which further shape the neuron’s output. The output is then routed to the assembly layer.
A similar process is followed by the neurons in the assembly layer.
For each image, this iterative process may repeat hundreds of times until the neural network has “learned” the image. This process is repeated for each of the 70,000 training images. Once the neural network has “learned” all the training images, it can quickly and accurately identify scanned images from ZIP codes to tax forms.
It’s this iterative training process which gives today’s Artificial Intelligence its remarkable power to learn whether simple digits, the meaning of a Shakespearean sonnet, or even an entire language.
Confused? That’s not surprising. It took computer scientists 75 years to develop Marvin Minsky’s original theory into today’s powerful Artificial Intelligence. I’ll discuss this evolution over the next few weeks.
Love the historical context. I am very pleased at the speed and the results of these early Ai systems like CHATGPT............going forward , many industries/occupations wil be disrupted in ways we cannot imagine just yet..........
Nice picture of Minsky. I know him more for his model of cognition based on “frames” proposed in the early 1970’s.