March 2017

Common Sense, the Turing Test, and the Quest for Real AI

The long tail and the limits to training.

By Hector J. Levesque

In trying to make sense of intelligent behavior, it is tempting to try something like this: We begin by looking at the most common cases of the behavior we can think of, and figure out what it would take to handle them. Then we build on it: We refine our account to handle more and more. We stop when our account crosses some threshold and appears to handle (say) 99.9 percent of the cases we are concerned with.

This might be called an engineering strategy. We produce a rough form of the behavior we are after, and then we engineer it to make it work better, handle more. We see this quite clearly in mechanical design. Given a rocket with a thrust of X, how can it be refined to produce a thrust of Y? Given a bridge that will support load X, how can it be bolstered to support load Y?

This engineering strategy does work well with a number of phenomena related to intelligent behavior. For example, when learning to walk, we do indeed start with simple, common cases, like walking on the floor or on hard ground, and eventually graduate to walking on trickier surfaces like soft sand and ice. Similarly, when learning a first language, we start by listening to baby talk, not the latest episode of The McLaughlin Group (or the Dana Carvey parody, for that matter). 

12314247113_b4a499a4a2_k
Step by StepLearning is easier when one starts with simple concepts—much like learning to walk is easiest when one starts on a smooth, flat surface.
Donnie Ray Jones

There are phenomena where this engineering strategy does not work well at all, however. We call a distribution of events long-tailed if there are events that are extremely rare individually but remain significant overall. These rare events are what Nassim Taleb calls “black swans.” (At one time, it was believed in Europe that all swans were white.) If we start by concentrating on the common cases, and then turn our attention to the slightly less common ones, and so on, we may never get to see black swans, which can be important nonetheless. From a purely statistical point of view, it might appear that we are doing well, but in fact, we may be doing quite poorly. 

A numeric example

To get a better sense of what it is like to deal with a long-tailed phenomenon, it is useful to consider a somewhat extreme imaginary example involving numbers.

Suppose we are trying to estimate the average of a very large collection of numbers. To make things easy, I am going to start by giving away the secret about these numbers: There are a trillion of them in the collection, and the average will turn out to be 100,000. But most of the numbers in the collection will be very small. The reason the average is so large is that there will be 1,000 numbers among the trillion whose values are extremely large, in the range of 100 trillion. (I am using the American names here: “trillion” means 1012 and “billion” means 109.)

Let us now suppose we know nothing of this, and that we are trying to get a sense of what a typical number is by sampling. The first 10 numbers we sample from the collection are the following:

2, 1, 1, 54, 2, 1, 3, 1, 934, 1.

The most common number here is 1. The median (that is, the midpoint number) also happens to be 1. But the average does appear to be larger than 1. In looking for the average, we want to find a number where some of the numbers are below and some are above, but where the total differences below are balanced by the total differences above. We can calculate a “sample average” by adding all the numbers seen so far and dividing by the number of them. After the first five numbers, we get a sample average of 12. We might have thought that this pattern would continue. But after 10 numbers, we see that 12 as the average is too low; the sample average over the 10 numbers is in fact 100. There is only one number so far that is larger than 100, but it is large enough to compensate for the other nine that are much closer to 100.

The problem with a long-tailed phenomenon is that the longer we looked at it, the less we understood what to expect.

But we continue to sample to see if this guess is correct. For a while, the sample average hovers around 100. But suppose that after 1,000 sample numbers, we get a much larger one: 1 million. Once we take this number into account, our estimate of the overall average now ends up being more like 1,000.

Imagine that this pattern continues: The vast majority of the numbers we see look like the first 10 shown above, but every thousand samples or so, we get a large one (in the range of 1 million). After seeing 1 million samples, we are just about ready to stop and announce the overall average to be about 1,000, when suddenly we see an even larger number: 10 billion. This is an extremely rare occurrence. We have not seen anything like it in 1 million tries. But because the number is so big, the sample average now has to change to 10,000. Because this was so unexpected, we decide to continue sampling, until finally, after having seen a total of 1 billion samples, the sample average appears to be quite steady at 10,000.

This is what it is like to sample from a distribution with a long tail. For a very long time, we might feel that we understand the collection well enough to be confident about its statistical properties. We might say something like this:

Although we cannot calculate properties of the collection as a whole, we can estimate those properties by sampling. After extensive sampling, we see that the average is about 10,000, although in fact, most of the numbers are much smaller, well under 100. There were some very rare large outliers, of course, but they never exceeded 10 billion, and a number of that size was extremely rare, a once-in-a-million occurrence. This has now been confirmed with over 1 billion test samples. So we can confidently talk about what to expect.

But this is quite wrong! The problem with a long-tailed phenomenon is that the longer we looked at it, the less we understood what to expect. The more we sampled, the bigger the average turned out to be. Why should we think that stopping at 1 billion samples will be enough?

To see these numbers in more vivid terms, imagine that we are considering using some new technology that appears to be quite beneficial. The technology is being considered to combat some problem that kills about 36,000 people per year. (This is the number of traffic-related deaths in the United States in 2012.) We cannot calculate exactly how many people will die once the new technology is introduced, but we can perform some simulations and see how they come out. Suppose that each number sampled above corresponds to the number of people who will die in one year with the technology in place (in one run of the simulation). The big question here is: Should the new technology be introduced?

1200px-Car_crash_1

Thue / Wikimedia

According to the sampling above, most of the time, we would get well below 100 deaths per year with the new technology, which would be phenomenally better than the current 36,000. In fact, the simulations show that 99.9 percent of the time, we would get well below 10,000 deaths. So that looks terrific. Unfortunately, the simulations also show us that there is a one-in-a-thousand chance of getting 1 million deaths, which would be indescribably horrible. And if that were not all, there also appears to be a one-in-a-million chance of wiping out all of humanity. Not so good after all!

Someone might say:

We have to look at this realistically, and not spend undue time on events that are incredibly rare. After all, a comet might hit the Earth too! Forget about those black swans; they are not going to bother us. Ask yourself: What do we really expect to happen overall? Currently 36,000 people will die if we do not use the technology. Do we expect to be better off with or without it?

This is not an unreasonable position. We cannot live our lives by considering only the very worst eventualities. The problem with a long-tailed phenomenon is getting a sense of what a more typical case would be like. But just how big is a typical case? Half of the numbers seen were below 10, but this is misleading because the other half was wildly different. Ninety-nine percent of the numbers were below 1,000, but the remaining 1 percent was wildly different. Just how much of the sampling can we afford to ignore? The sample average has the distinct advantage of representing the sampling process as a whole. Not only are virtually all the numbers below 10,000, but the total weight of all the numbers above 10,000, big as they were, is matched by the total weight of those below. And 10,000 deaths as a typical number is still much better than 36,000 deaths.

But suppose, for the sake of argument, that after 1 billion runs, the very next number we see is one of the huge ones mentioned at the outset: 100 trillion. (It is not clear that there can be a technology that can kill 100 trillion things, but let that pass.) Even though this is a one-in-a-billion chance, because the number is so large, we have to adjust our estimate of the average yet again, this time to be 100,000. And getting 100,000 deaths would be much worse than 36,000 deaths.

This, in a nutshell, is the problem of trying to deal in a practical way with a long-tailed phenomena. If all of your expertise derives from sampling, you may never get to see events that are rare but can make a big difference overall. 


From Common Sense, the Turing Test, and the Quest for Real AI by Hector J. Levesque, published by the MIT Press.

Lead image courtesy of Mopic / Shutterstock

Levesque