The bigram model
Teaching a machine to write, from scratch and just by counting.
The trick: predicting
When we're small, nobody hands us a manual for learning to talk. We learn by living: we hear the people around us, we tie a tone of voice to a smiling face, and little by little the meaning of things sinks in.
But how do you teach writing to a machine that has never lived a single day? To a box of wires and silicon, the word apple isn't sweet or red. It means nothing. If it can't understand the world, writing about it looks impossible.
So before we build anything, a small experiment.
You probably filled every blank without much effort. But look at the last one: you had no idea what Fli fli fla means. You just looked at what came before, sensed the logic, and guessed what came next.
You just found the trick. Since engineers couldn't teach machines to understand the world the way we do, they changed the rules of the game. Instead of teaching them to reflect, they taught them to predict.
Today, the big models do this with whole sentences. But to really understand the magic behind it, we'll go to the most basic thing of all: predicting what the next letter is.
Our final goal is to build exactly this: you give it a letter, and it bets on the one that comes next.
It looks like magic, but underneath it's just very basic math. Now the big question: how do we build this from scratch, when the machine can't even read?
Hunting the pattern
Our language isn't chaos. Mash the keyboard at random and you get something like asdfghjkl, which means nothing. We write following an invisible structure: nobody ever explained that after «q» comes almost always a «u», or that three consonants in a row are rare. Your brain just absorbed it from reading and listening.
Since the language already hides that pattern, all we need is for the machine to read text and notice who goes hand in hand with whom. Let's start with a simple sentence.
We've seen how it hunts for little pairs of letters. To really get it, let's focus on a single one: the «t». We'll give it different sentences and see which letter it decides is the «t»'s best partner.
Change the text and watch which letter wins after the «t».
Depending on the text you give it, it learns a different rule, and with so little text the count lies. Feed it very short texts and its view of the world is limited and biased. To learn the real rules, it needs far more information: a giant text.
So let's get serious. Let's have our machine read all of Shakespeare.
That row is everything there is about the «t» in all of Shakespeare. The machine pulled out every one of its connections on its own, just by counting.
This process —handing a machine a giant text so it reads, counts, and builds its own tables of rules— is called training data. You just watched, first-hand, how a model is trained. (One catch: it learned from Shakespeare, so it'll talk like it's 400 years ago. Change the book and you change the machine.)
Too predictable
Now the machine knows that after a «t», «h» shows up in droves and «o» far less. But a fistful of loose counts is useless for writing: 7,071 means nothing if you don't know out of how many. Those numbers have to become probabilities.
And it's plain old division: take the «t» row, add up everything in it, and see what slice each partner gets.
There it is: the same row, now as percentages that add up to 100%. The «h» takes about 36%, the space 29%, the «o» 10%… Those are the machine's bets.
And now the real question: with those bets on the table, which letter does it pick? The safe move would be to always take the highest one. Let's see what happens.
Always «h». The safe choice turns out to be the deadest one: this way, after a «t» nothing different would ever show up, one «h» after another. To write with any life it needs a pinch of randomness; but not just any kind.
The engineers' idea was a die. A loaded one, of course: loads of «h» faces, plenty of space, the odd «o», and almost none of the rare letters. So the likely thing usually comes up, but every so often it surprises you. Roll it yourself and see.
Look at that! We've got the whole trick for the «t»: count it, turn it into percentages, and pick with a spark of randomness. And the question asks itself: what if we did exactly this for every letter at once?
The matrix is born
We have a row for «t». What about «a»? And «h»? And all the rest?
Stack one row per letter and look what comes out: a grid. Each row is the letter you start from. Each column, the one that could come next. Each cell, how many times we saw it.
You just built something with a name of its own: a transition table.
And that's only one corner of the language: lowercase. Capitals, periods, commas, numbers are still missing. Count them all and the table grows to its real size, with far more hidden rules inside.
This is the whole table, for real. It looks like a mess of light, but it's the manual of a language written in numbers. Every lit cell is a rule; every dark gap, a pair that almost never happens. And nobody taught it any of them.
Let's write!
The table now holds every rule the language follows. And picking one letter we already know how to do: look at its row, roll the die. Here is that step in slow motion, and each letter that lands is the starting point for the next.
And there it is: a machine that writes on its own. Nobody taught it spelling, or grammar, or a single rule. It just counted pairs of letters in a pile of text, and everything came out of that. You built it, from scratch.
You've seen it in slow motion. At full speed it's this: one letter after another, no brakes, whole phrases pour out at once.
And now the bad news. Read it again and it almost sounds like a real language: the letters fit together… but they aren't words. Babble with a good accent. We pulled it off, it writes on its own. But what a mess, right? Why does it write so badly?
And what you built has a name:
a bigram model
The simplest language model there is. And it's the first brick of everything else. ChatGPT included.
The bigram's ceiling
Before we fix it, let's understand why it writes so badly. What comes after «th»? The machine couldn't care less about the «t»: it only looks at the «h». To it, «th», «sh» and «wh» are exactly the same thing.
It's not forgetful. It's blind from birth. No matter how much text you give it, it will never tell «th» from «sh». This isn't a bug you fix with more data. It's the ceiling of the model.
What if it could see more than one letter? Your turn: a word reveals itself one letter at a time, and you bet on the next.
Did you feel it? With one letter you were guessing blind. With almost the whole word in front of you, almost certain. More context, better prediction. That's exactly what our model is missing: it only sees one piece back. Just like you with «hi»: it reacts to the last thing it heard, with no idea about the rest.
What if we teach it to look at two letters? Three? Five? That's already a different model. And it's the next one.