How a frontier AI model really works, end to end
No magic, no hype. Tokens, attention, training, and inference, explained the way you would explain it to a smart friend.
A modern AI model can write code, explain physics, and draft a contract. It feels like it understands. Underneath, it is doing one almost embarrassingly simple thing, billions of times, very fast. Once you see the trick, the whole field stops being magic and starts making sense.
The one trick: predict the next piece
A language model does exactly one thing: given some text, it predicts what comes next. That is it. “The capital of France is” → “Paris.” Everything else, the essays, the code, the reasoning, is this single ability applied over and over, each predicted word fed back in to predict the next.
Step 1: turn text into numbers (tokens)
Computers only do math, so the first job is to turn text into numbers. The model breaks text into tokens, chunks that are often a word, sometimes a part of a word (“un-believ-able”), sometimes punctuation. Each distinct token has an ID number. “Hello world” might become [15496, 995].
Step 2: give each token a meaning (embeddings)
An ID number is arbitrary; it does not capture that “king” and “queen” are related. So each token is turned into a long list of numbers called an embedding, think of it as coordinates in a space of meaning. In that space, similar things sit close together. The famous example: the direction from “king” to “queen” is roughly the same as from “man” to “woman.” The model is not told this; it discovers it by reading.
Step 3: let each word look at the others (attention)
Words only mean something in context. In “the bank of the river” versus “the bank approved the loan,” the word “bank” needs to look at its neighbors to know which meaning is intended. The mechanism that lets every token look at every other token and decide which ones matter is called attention, and it is the single idea that made modern AI work.
For each word, attention asks: “to predict what comes next, which earlier words should I pay attention to, and how much?” It then mixes in information from exactly those words. “River” pulls “bank” toward the water meaning; “loan” pulls it toward the money meaning. Do this for every word, in parallel, and the model builds a context-aware understanding of the whole passage.
Step 4: stack it deep (the transformer)
One round of attention is useful. The breakthrough architecture, the transformer, stacks dozens to over a hundred of these layers. Each layer refines the representation a little: early layers catch grammar and simple relationships; middle layers assemble facts and meaning; late layers shape the actual next-word prediction. The text passes up through the stack, getting “understood” a little more at each step.
At the very top, the model produces a probability for every possible next token, a ranked guess. “The capital of France is” yields “Paris” at 91%, “Lyon” at 3%, and so on. The numbers that do all this, the model's parameters or “weights,” number in the billions. A “70B model” has 70 billion of these dials.
Where the dials come from: training
The model is not programmed; it is trained. It happens in stages, and they are genuinely different:
- 1Pretraining: read almost everythingShow the model trillions of words and have it guess the next token, again and again. Each wrong guess nudges billions of dials a hair in the right direction. After months on thousands of GPUs, it has absorbed grammar, facts, code, and reasoning patterns. This stage is most of the cost.
- 2Fine-tuning: learn to be helpfulA pretrained model will happily continue your text but is not yet an assistant. Show it many examples of good question-and-answer behavior so it learns to follow instructions and respond, not just autocomplete.
- 3RLHF: learn manners and judgementHumans (and other models) rank responses as better or worse, and the model is tuned to prefer the better ones. This is where helpfulness, honesty, and refusing harmful requests get reinforced. It is the difference between raw capability and something you can ship.
What happens when you actually use it: inference
Using a trained model is called inference, and it is a loop. Your prompt goes in, the model predicts the next token, that token is added to the text, and it predicts again, one token at a time. This is why answers stream out word by word, you are watching the loop run.
Two knobs matter a lot here:
- The context window is how much text the model can look at at once, its working memory. Everything outside it is simply invisible to the model. Bigger windows let it consider whole documents, but cost more to run.
- Temperature controls how adventurous the next-token choice is. Low temperature picks the most likely token (focused, repetitive); higher temperature samples more freely (creative, riskier). Same model, very different personality.
Why it sometimes makes things up
A model that only ever predicts plausible next tokens will sometimes produce something plausible but false, a hallucination. It is not lying; it has no notion of truth, only of what sounds right. This is structural, not a bug to be fully patched. It is also why serious AI systems pair the model with retrieval (give it the real documents), tools (let it run a calculation or a query), and evaluation (check the output), rather than trusting it alone.
What it is, and what it is not
- It is an extraordinary pattern-matcher over language, trained on a huge slice of what humans have written.
- It is not a database, a search engine, or a mind. It does not “know” in the way you do; it has learned what tends to follow what.
- It is astonishingly capable precisely because so much of human knowledge and reasoning leaves a statistical fingerprint in text.
Why this still matters in ten years
The models will get bigger, faster, multimodal (images, audio, video), and wired into tools and other agents. But the core, predict the next piece, attention to decide what matters, deep stacks of layers, train on data, loop at inference, is the foundation the whole field is building on. Understand this and you can reason about the next ten years of AI instead of being surprised by it.
Written by the Stratiflux engineering team
We build and run this kind of infrastructure and AI for companies, and train the engineers who do it. If a piece of this is on your plate, we can help.