All articles
AI 13 min readJune 13, 2026

How a frontier AI model really works, end to end

No magic, no hype. Tokens, attention, training, and inference, explained the way you would explain it to a smart friend.

A modern AI model can write code, explain physics, and draft a contract. It feels like it understands. Underneath, it is doing one almost embarrassingly simple thing, billions of times, very fast. Once you see the trick, the whole field stops being magic and starts making sense.

The one trick: predict the next piece

A language model does exactly one thing: given some text, it predicts what comes next. That is it. “The capital of France is” → “Paris.” Everything else, the essays, the code, the reasoning, is this single ability applied over and over, each predicted word fed back in to predict the next.

The key idea
The model is not looking anything up. It has no database of facts inside it. It has learned, from enormous amounts of text, an astonishingly detailed sense of what tends to follow what. “Knowledge” is stored as patterns, not records.

Step 1: turn text into numbers (tokens)

Computers only do math, so the first job is to turn text into numbers. The model breaks text into tokens, chunks that are often a word, sometimes a part of a word (“un-believ-able”), sometimes punctuation. Each distinct token has an ID number. “Hello world” might become [15496, 995].

Text
"Hello world"
Tokens
"Hello", " world"
IDs
[15496, 995]
Text becomes tokens becomes numbers. The model only ever sees and produces numbers; the words are for us.

Step 2: give each token a meaning (embeddings)

An ID number is arbitrary; it does not capture that “king” and “queen” are related. So each token is turned into a long list of numbers called an embedding, think of it as coordinates in a space of meaning. In that space, similar things sit close together. The famous example: the direction from “king” to “queen” is roughly the same as from “man” to “woman.” The model is not told this; it discovers it by reading.

Note
This is also how search and “retrieval” work in AI apps: turn your documents and your question into embeddings, then find the documents whose coordinates are nearest the question. Meaning becomes distance.

Step 3: let each word look at the others (attention)

Words only mean something in context. In “the bank of the river” versus “the bank approved the loan,” the word “bank” needs to look at its neighbors to know which meaning is intended. The mechanism that lets every token look at every other token and decide which ones matter is called attention, and it is the single idea that made modern AI work.

For each word, attention asks: “to predict what comes next, which earlier words should I pay attention to, and how much?” It then mixes in information from exactly those words. “River” pulls “bank” toward the water meaning; “loan” pulls it toward the money meaning. Do this for every word, in parallel, and the model builds a context-aware understanding of the whole passage.

Step 4: stack it deep (the transformer)

One round of attention is useful. The breakthrough architecture, the transformer, stacks dozens to over a hundred of these layers. Each layer refines the representation a little: early layers catch grammar and simple relationships; middle layers assemble facts and meaning; late layers shape the actual next-word prediction. The text passes up through the stack, getting “understood” a little more at each step.

Output: probabilities for the next tokene.g. "Paris" 91%, "Lyon" 3%...
Layer Nshapes the final prediction
...dozens of layers...facts, meaning, reasoning assemble here
Layer 2phrases and relationships
Layer 1grammar, nearby words
Input: token embeddingswords as coordinates of meaning
A transformer is a tall stack of identical layers. Each one looks at the whole sentence (attention) and then thinks about each word (a small neural network), passing a richer version upward.

At the very top, the model produces a probability for every possible next token, a ranked guess. “The capital of France is” yields “Paris” at 91%, “Lyon” at 3%, and so on. The numbers that do all this, the model's parameters or “weights,” number in the billions. A “70B model” has 70 billion of these dials.

Where the dials come from: training

The model is not programmed; it is trained. It happens in stages, and they are genuinely different:

  1. 1
    Pretraining: read almost everything
    Show the model trillions of words and have it guess the next token, again and again. Each wrong guess nudges billions of dials a hair in the right direction. After months on thousands of GPUs, it has absorbed grammar, facts, code, and reasoning patterns. This stage is most of the cost.
  2. 2
    Fine-tuning: learn to be helpful
    A pretrained model will happily continue your text but is not yet an assistant. Show it many examples of good question-and-answer behavior so it learns to follow instructions and respond, not just autocomplete.
  3. 3
    RLHF: learn manners and judgement
    Humans (and other models) rank responses as better or worse, and the model is tuned to prefer the better ones. This is where helpfulness, honesty, and refusing harmful requests get reinforced. It is the difference between raw capability and something you can ship.
Three stages turn a pile of random numbers into a useful assistant. Each stage teaches something the previous one could not.

What happens when you actually use it: inference

Using a trained model is called inference, and it is a loop. Your prompt goes in, the model predicts the next token, that token is added to the text, and it predicts again, one token at a time. This is why answers stream out word by word, you are watching the loop run.

Your prompt
tokens in
Predict next token
one step
Append it
add to context
Repeat
until done
Inference is a loop: predict one token, append it, predict the next. That is why responses appear gradually rather than all at once.

Two knobs matter a lot here:

  • The context window is how much text the model can look at at once, its working memory. Everything outside it is simply invisible to the model. Bigger windows let it consider whole documents, but cost more to run.
  • Temperature controls how adventurous the next-token choice is. Low temperature picks the most likely token (focused, repetitive); higher temperature samples more freely (creative, riskier). Same model, very different personality.

Why it sometimes makes things up

A model that only ever predicts plausible next tokens will sometimes produce something plausible but false, a hallucination. It is not lying; it has no notion of truth, only of what sounds right. This is structural, not a bug to be fully patched. It is also why serious AI systems pair the model with retrieval (give it the real documents), tools (let it run a calculation or a query), and evaluation (check the output), rather than trusting it alone.

What it is, and what it is not

  • It is an extraordinary pattern-matcher over language, trained on a huge slice of what humans have written.
  • It is not a database, a search engine, or a mind. It does not “know” in the way you do; it has learned what tends to follow what.
  • It is astonishingly capable precisely because so much of human knowledge and reasoning leaves a statistical fingerprint in text.

Why this still matters in ten years

The models will get bigger, faster, multimodal (images, audio, video), and wired into tools and other agents. But the core, predict the next piece, attention to decide what matters, deep stacks of layers, train on data, loop at inference, is the foundation the whole field is building on. Understand this and you can reason about the next ten years of AI instead of being surprised by it.

The one thing to remember
An AI model is a next-token predictor that got so good at the game that prediction started to look like understanding. Treat it as a brilliant, fast, slightly unreliable pattern engine, give it the right context and tools, check its work, and it becomes genuinely powerful.

Written by the Stratiflux engineering team

We build and run this kind of infrastructure and AI for companies, and train the engineers who do it. If a piece of this is on your plate, we can help.

Deep Dive Locked

Enter your email to instantly claim 10 free credits and read the rest of this highly technical deep dive.