AI 10 min readJune 23, 2026

Six things every AI engineer should know (that get increasingly niche)

Anyone can call an API. These six judgment calls are what separate someone who uses AI from someone who can actually build and operate it in 2026.

In 2026, almost anyone can wire up an LLM and get a demo working. What actually separates an AI engineer from someone who pasted an API key is judgment: knowing why a system misbehaves and what to change. Here are six things every AI engineer should know. They get increasingly niche, but every one of them is a judgment call you make on real systems, and each is explained from scratch below.

The one mental model that ties them together

A modern AI feature is almost never just “the model.” It is a system: a retriever, a prompt, some memory, tools, an output parser, and an evaluation harness, all wrapped around a model that you mostly cannot change. Beginners blame the model. Engineers debug the system. Keep that in mind, because five of the six calls below are really about telling the model apart from the machinery around it.

1. Most hallucinations are retrieval failures, not model failures

A hallucination is when a model states something confidently wrong. In a system that looks things up before answering, a pattern called RAG (Retrieval-Augmented Generation), the model can only answer from the text it was handed. So when the answer is wrong, the usual culprit is not the model inventing things, it is that the right information was never retrieved and put in front of it.

Query

the question

Retrieve

search your data

Rerank

optional

Context

top chunks

Model

generates

Answer

In a RAG system the model only ever sees the retrieved context. If the right chunk never arrives, even a flawless model will 'hallucinate' a guess.

The fixes are almost all upstream of the model: better chunking (how you split documents), better embeddings (how meaning is encoded), query rewriting, adding a reranker, or widening what you fetch. Reach for fine-tuning or a bigger model only after you have proven the right context was actually retrieved.

The judgment call

Before you blame the model, look at exactly what was retrieved for the failing question. Nine times out of ten the “hallucination” is the retriever handing the model the wrong page, or no page at all.

2. When to use semantic search versus hybrid search

Semantic search turns text into vectors (lists of numbers that capture meaning) and finds passages that are about the same thing, even if they share no words. It is brilliant for fuzzy, conceptual questions. Its weakness: it can miss exact tokens, a product SKU, an error code, a person's name, a rare acronym, because those carry meaning humans care about but the vector blurs.

Keyword search (the classic BM25) does the opposite: it nails exact terms but is blind to paraphrase. Hybrid search runs both and fuses the results, usually then passing them through a reranker. You get meaning and exact matches.

Pure semantic is fine when

Questions are conceptual and paraphrased ("how do I keep costs down?")
The corpus is prose: docs, articles, support tickets
Exact identifiers rarely matter

Reach for hybrid when

Exact tokens matter: codes, SKUs, names, legal/medical terms
Users search with jargon or acronyms
Recall is critical and you cannot afford to miss the one right doc

Pick by the kind of query. In production, hybrid + rerank is the safe default.

The judgment call

Default to hybrid search with a reranker in production. Drop to pure semantic only when you have confirmed exact-term matches do not matter for your data.

3. When an agent needs memory, and when memory becomes a liability

An agent is an LLM that runs in a loop, taking steps and using tools. Memory is whatever it carries between steps or sessions: conversation history, a scratchpad, a long-term store of facts. Memory is what makes an assistant feel continuous and personalized. But it is not free, and more of it is not better.

Memory becomes a liability when

A wrong fact gets stored and poisons every future answer
Stale state makes the agent act on things that changed
History balloons the context window: more cost, more latency
It quietly retains personal data you now have to govern

Memory earns its place when

The task genuinely needs continuity across turns or sessions
Personalization changes the answer in a way users value
It is scoped and retrievable, not "dump everything into the prompt"

Memory is a tool with a cost. Add it deliberately, scope it tightly, and expire it.

The mature pattern is to treat memory like a database with retention rules: store little, make it retrievable rather than always-on, and expire or curate it. A smaller, cleaner context usually beats a giant pile of history.

The judgment call

Ask: does this task actually need to remember? If not, statelessness is a feature, it is cheaper, faster, and cannot be poisoned.

4. When structured outputs are worth the loss of reasoning flexibility

A structured output forces the model to answer in a fixed shape, usually JSON matching a schema, so your code can consume it reliably. The trade-off is real: tightly constraining the format can shorten the model's “thinking room” and dent quality on genuinely hard reasoning.

Use structured output when the result feeds another system, an API call, a database write, a UI component, where reliability and parseability matter more than prose nuance.
Let it reason freely when the task is hard analysis or multi-step logic, then extract the structure in a second step.

The judgment call

For hard tasks, separate thinking from formatting: let the model reason in free text first, then make a second, cheap call (or a parser) that turns the reasoning into clean JSON. You keep both the reasoning quality and the reliable shape.

5. When latency matters more than accuracy

Not every call should chase the best possible answer. Latency is how long the user waits. For some features a fast, slightly-less-perfect answer is strictly better, because a slow one is abandoned before it arrives.

Latency wins

Interactive UX: chat first-token, autocomplete, voice
High-volume, low-stakes calls where "good enough" is fine
Anything a human is actively waiting on

Accuracy wins

Offline or batch work: reports, analysis, data pipelines
High-stakes answers: medical, legal, financial, irreversible actions
Anything fed to automation that acts without review

Map each call to its job. Many real systems route: fast model for the live path, strong model for the rest.

When latency rules, your levers are smaller/faster models, caching, streaming the answer, and trimming retrieval. When accuracy rules, you can afford bigger models, more retrieval, and techniques like asking the model several times and taking the consensus.

The judgment call

Give every LLM call a latency budget tied to its user experience. A 3-second answer that ships beats a 9-second answer nobody waits for.

6. When an evaluation failure is the model versus the system around it

Evaluation (“evals”) is how you measure whether your AI is actually good, ideally on a fixed set of test cases. When a score drops, the instinct is to swap the model. Resist it. The failure is often in the machinery: a broken prompt template, bad chunking, a flaky tool, a parsing bug, or even a faulty eval (wrong gold answers, a miscalibrated LLM-as-judge).

1
Check the eval itself
Are the expected answers correct? Is the judge calibrated? A bad test fails good systems.
2
Check what was retrieved
Did the right context reach the model? (See point 1, this is the most common cause.)
3
Check the prompt and parsing
A changed template, a stray instruction, or a brittle JSON parser breaks results with no model change.
4
Check the tools
A tool returning an error or stale data makes the whole answer wrong through no fault of the model.
5
Only then, suspect the model
If the inputs were all correct and the answer is still wrong, now it is a model problem.

Trace a failing eval from the outside in. Most 'the model is dumb' verdicts turn out to be a pipeline bug.

The judgment call

Never change the model first. Trace the failure end to end. Swapping models to fix a retrieval or prompt bug just hides the real problem and costs you more.

The through-line: judgment, not tools

Notice the pattern. Every one of these is the same skill in a different costume: separating the model from the system around it, and choosing deliberately, retrieval versus model, semantic versus hybrid, memory versus none, structure versus freedom, speed versus accuracy. Tools change every few months. This judgment is what upgrades you into AI in 2026 and keeps you valuable as the tools churn.

If you want to build this judgment on real systems, with machine-verified missions instead of tutorials, that is exactly what the Stratiflux Academy is for.

Written by the Stratiflux engineering team

We build and run this kind of infrastructure and AI for companies, and train the engineers who do it. If a piece of this is on your plate, we can help.

See what we build Free tools & guides

Buy me a coffee

Everything here is free. If it saved you time or taught you something, a small tip keeps the work going.

Keep reading

FundamentalsHow the internet actually moves your data FundamentalsInside a cloud data center (and what 'the cloud' really is)AIHow a frontier AI model really works, end to end