Chapter 2 · The counting era

A longer memory

The last machine only remembered the last letter you typed. Let's give it more memory and see how far it goes.

12 min · keep counting

01Looking further back

Looking further back

In the last chapter we left our machine trapped in a perpetual present. The bigram model could only remember the last character it had read or written. The moment it moved on to the next one, it forgot everything that came before.

That was exactly the source of many of its mistakes: situations that to us are different looked, to the machine, exactly the same.

How can we give it a little more context? The simplest idea is to let it look a little further back. Does that change its bets?

A little more context, and things change quite a bit.

From now on, so we don't keep saying "a two-letter window" or "a three-letter window", we'll call that window size N. Our old bigram model was, simply, a model with N=1. Let's see how to build one with N=2.

02Boxes within boxes

Boxes within boxes

If we want to give the machine a two-character memory, we have to think about how to adapt the tool we built in the last chapter. The question is how much we need to change.

Do we need a new machine? Or can we reuse the one we already have? Let's see it with a small example.

If you watch step by step, the counting mechanic is identical. The only thing changing here is that now we record pairs instead of single letters. That looks like a minor detail, but it has an enormous impact on the size of what we're building. Let's see it on a real Shakespeare text and watch how the table grows, starting with just the letter "a".

That number doesn't come out of nowhere. To understand where it comes from, let's set the whole table aside and look at a much smaller part of it.

Put one table next to the other. This whole block takes up exactly the same space as our old bigram model. It's as if each letter needed its own bigram; and since there are 27 possible letters, we need 27 copies. That's why the table grows like this.

And what happens if we keep widening the window?

Raising the window size to N=3 or N=4, all we're doing is adding more rows to the same structure. The model we designed in the first chapter wasn't useless; it simply lacked the room to hold a little more context. We've confirmed the architecture scales. Let's see whether it really writes better and is worth it.

03Generation

Generation

We already have our table full of counts and percentages drawn from Shakespeare. Now comes the interesting part: using that information to write new text. How does it do it, exactly — a machine that only knows how to read numbers off a table?

If the interface looked familiar, it's because you're seeing exactly the same process as before: we look up the current pair in our big table, turn its counts into percentages, and roll the virtual die to pick the next letter. The loop starts again with no new rule.

And since this same logic works for any window size, we can set several models racing at once and really see the difference.

If you think about it coldly, it's pretty astonishing. We started literally from nothing: a blank machine that didn't know a single rule of grammar and that, on pure statistics and counting letters, has ended up writing sentences that fool the eye. It isn't that the machine is thinking — but it certainly works.

Seeing such an obvious improvement from a bigram to a six-letter model, intuition tells us the next step. If we want a perfect model, we just have to raise N. With a 40-letter memory window —enough to remember a whole sentence— we should have a system capable of writing long texts without slipping. Let's try to build it.

04The price of memory

The price of memory

To build a model with N=40 we just follow our own rule: take the size of the previous model and multiply it by 27. Let's nudge the dial up little by little.

Exponential growth is relentless. There comes a point where the table is physically impossible to store on any computer on the planet.

Fine, size is a problem. But storage always gets cheaper. Let's imagine for a moment that we have infinite hard drives and can store whatever table we want.

And notice what we're asking when we store it. With a forty-letter window —a whole sentence— each row no longer answers "which letter follows th"; it answers something far more ambitious: what comes after this entire sentence. The key to each row is, practically, a whole sentence.

Even so, even without the physical limit, the method has a far more serious flaw.

After reading all the text on the Internet, the vast majority of the table is still empty. And it makes sense that it is: there are letter combinations that simply never go together because they mean nothing. If they never appear in the real world, it seems like we shouldn't care that their cell is a zero. What, exactly, is the problem with the table being empty?

Try it yourself: type a perfectly ordinary sentence and watch which row it looks for.

This is the problem with having an empty table. "The quantum elephant" is a strange phrase, but there's nothing wrong with it. Even if you've never heard it before, you could probably make up several reasonable ways to continue it. Our machine can't.

By relying only on exact matches, it needs to have seen that specific combination at some point in its training. When it tries to look it up in the table and finds nothing, it's left with no information to lean on.

And this problem doesn't go away no matter how much data we add. However much text we gather, there will always be new sentences, unexpected combinations, expressions nobody had written before. Language is constantly changing; the table is not. So when the machine runs into something that falls outside what it has memorized, it's left with no answer.

But here an interesting doubt appears. Is the problem really the holes in the table? Or is there something deeper we're missing?

05The end of counting

The end of counting

The quantum-elephant example showed us what happens when the machine runs into a combination it has never seen. We might think that's the main problem: if the table is empty, of course it doesn't know what to do.

But it turns out the machine keeps failing even when the information it needs does exist in the table. Let's see it with an example far more everyday than the elephant.

To us it's obvious that if a dog sleeps, a cat can too. We know they're both animals and share behaviors; we don't need to have read the exact sentence to guess how it continues.

But look at the table. To the model, "el perro" lands in one row and "el gato" in another thousands of rows away. They're sealed boxes, with no bridge between them. The machine couldn't care less that both are animals: to it they're two different strings of letters with nothing to do with one another.

If it hasn't memorized the exact match, it goes blind. It can't generalize, it can't apply logic.

It is, literally, a parrot with a very good memory.

We've taken the method of counting letters to its physical and logical limit. And although it carried us a long way, this wall is impassable.

If we want to build a true artificial intelligence that can survive sentences it has never seen, we have to abandon the idea of using giant tables. We have to change the paradigm entirely.

Instead of building a machine that counts and memorizes the past, we need to build one that can try to guess, realize it got it wrong, and adjust its internal gears so it doesn't fail again. We need a system that, instead of counting, is able to learn.

And to achieve that, engineers had to stop looking at statistics textbooks and start looking at biology.

Welcome to the age of learning.

Counting brought us this far. Crossing the wall means giving up counting.

That things which look alike should be treated alike. That's what neural networks are about.