AI 14 min readJune 13, 2026

How GPUs actually run AI (and why latency is everything)

The stack underneath every model: GPU hardware, parallel training, the interconnect, and low-latency inference, in plain language.

Models get the headlines. Underneath them sits a stack of hardware and software that most people never see, and it is where the real engineering happens. The people who keep GPUs busy, training fast, and inference cheap are some of the most sought-after engineers in the world. Here is what they actually understand.

Note

This is the foundation of our premium AI Infrastructure & GPU Performance track. If the rest of this article excites you, that is the job.

Why a GPU, and not a CPU

A CPU is a few very clever workers who can do complicated, varied tasks quickly. A GPU is thousands of simpler workers that all do the same kind of math at the same time. AI is almost entirely one kind of math, matrix multiplication, done at gigantic scale. A transformer is, underneath, a tower of matrix multiplies and a thing called attention. That is exactly the workload a GPU was built to devour.

Inside the chip, just enough vocabulary

You do not need to write GPU code to reason about performance, but a few terms unlock everything else:

SMs (streaming multiprocessors) are the GPU's compute units. A big GPU has 100+ of them, each running many threads at once.
Warps are groups of 32 threads that execute together in lockstep. The whole machine is built around doing the same operation across 32 lanes.
HBM is the GPU's main memory, very fast, but still the usual bottleneck. Moving data to and from it costs more than the math itself, surprisingly often.
Tensor Cores are special hardware just for matrix multiply, and only fully engaged at lower precision (FP16/BF16/FP8). Run in FP32 and you leave most of the chip idle.
Occupancy is how busy the SMs are kept. Low occupancy means the GPU is waiting, for memory, for the CPU, or for other GPUs.

The key idea

The whole game is keeping those thousands of workers fed. A GPU at “90% utilization” can still be wasting most of its potential if it is running the wrong precision or stalling on memory. The first question a performance engineer asks is always: what is it waiting for?

What actually makes training slow

Training repeats a simple loop billions of times: run the model forward, measure how wrong it is, compute corrections backward, update the weights. The cost hides in four places:

Load data

feed the GPU

Forward

predict

Backward

compute gradients

Sync + update

across GPUs

The training loop. The forward and backward passes are the compute; everything else is overhead that a tuned system overlaps or hides.

The usual culprits, and their tells:

Wrong precision. FP32 ignores the Tensor Cores. Switching to mixed precision (BF16/FP16) is often the single biggest free win.
Memory pressure. The activations saved for the backward pass can blow past HBM. Techniques like activation checkpointing and sharding trade a little compute for a lot of memory.
Starved input. If the data pipeline can't keep up, the GPU sits idle between steps. The fix is cheap: more loader workers, prefetching, faster storage.
Communication. The moment you use more than one GPU, they must agree on the new weights, and waiting on each other can dominate. This is the deep one.

Many GPUs, one model: the kinds of parallelism

Frontier models are far too big for one GPU, so the work is split. There is more than one way to split it, and real training runs combine several:

Data parallelismevery GPU has the full model, different data; sync gradients

Tensor parallelismsplit one giant layer across GPUs

Pipeline parallelismdifferent layers on different GPUs, like an assembly line

Context parallelismsplit a very long input sequence across GPUs

Expert parallelismroute to different experts on different GPUs (MoE)

The forms of parallelism. Big training runs stack several of these at once, and the art is balancing compute against communication.

A related trick, sharding (FSDP, or ZeRO), splits not just the work but the model's parameters, gradients, and optimizer state across GPUs so none of them has to hold the whole thing. It is how a model that could never fit on one GPU trains on a cluster of them.

The interconnect: the part nobody sees until it hurts

Once GPUs must talk constantly, how they are wired together becomes as important as how fast they compute. The library that coordinates the talking is NCCL, and its most important operation is the all-reduce: every GPU shares its gradients and ends up with the combined result. Do that badly and your expensive GPUs spend their time waiting.

NVLink / NVSwitch: very fast direct links between GPUs in the same server. You want your chattiest GPUs here.
PCIe: the ordinary bus, much slower. Gradient sync over PCIe is a classic, hidden bottleneck.
InfiniBand / RoCE: high-speed networking between servers, so a training run can span hundreds of machines.

The signature of a comms bottleneck

GPU utilization that sawtooths, busy, idle, busy, idle, in time with each training step. The GPUs are computing, then all stopping to wait for the slowest one to finish talking. The fix is topology (use NVLink), overlap (sync while still computing), and tuning NCCL.

Inference is a different animal

Serving a trained model has its own physics, and it splits into two phases that behave nothing alike:

1
Prefill: read the prompt
The model processes your entire prompt at once. This is compute-heavy and parallel, the GPU is busy. A long prompt means a long prefill, which is most of your time-to-first-token.
2
Decode: write the answer
The model then generates the answer one token at a time, each step depending on the last. This is memory-bound, not compute-bound, and it is where most of the wall-clock time of a long answer goes.

The two phases of LLM inference. They compete for the same GPU, which is why naive serving is slow, and why the right framework matters so much.

The concepts that make serving fast all come from managing those two phases:

KV cache. So it doesn't re-read the whole conversation every token, the model caches the attention keys and values. This cache can be huge and is usually what limits how many requests fit on a GPU.
PagedAttention. Manage that cache in small pages instead of one big block, so memory isn't wasted and many requests of different lengths can share the GPU.
Continuous batching. Instead of a fixed batch where everyone waits for the slowest, add and remove requests on the fly. This alone often multiplies throughput several times over.
Quantization. Run the model at lower precision (INT8, FP8). It frees memory, fits more concurrent requests, and speeds decode, at a small, measurable quality cost.

This is why serving frameworks like vLLM, TensorRT-LLM, and SGLang exist: they implement exactly these tricks. Putting one in front of a model is frequently a 2-4x throughput win on the same hardware, the cheapest performance you will ever buy.

How an expert actually finds the problem

The skill that ties it all together is profiling, and the method is always the same: top-down, cheapest tool first.

1
Watch the high-level numbers
Throughput (tokens or samples per second), GPU utilization, memory, and step-time breakdown. nvidia-smi and DCGM dashboards. Is the GPU even busy?
2
Profile the framework
PyTorch Profiler shows which operations cost the most and separates CPU overhead from GPU time. Most "my training is slow" answers are here.
3
Look at the whole timeline
Nsight Systems shows the CPU, the GPU, memory copies, and NCCL on one timeline, exposing idle gaps, stalls, and communication waits.
4
Open the kernel
If one operation dominates, Nsight Compute inspects it: occupancy, memory access, Tensor Core usage. This is the deepest, rarest rung.

The profiling ladder. You only descend a rung when the rung above points you deeper. Most problems are caught in the first two.

And it all runs on Kubernetes

In a real cluster, none of this is hand-placed. GPUs are exposed to Kubernetes through a device plugin; the NVIDIA GPU Operator installs the drivers and runtime; DCGM exports the GPU metrics that fill the dashboards. Before a new cluster is trusted with a million-dollar training run, it is acceptance-tested: hardware health, driver compatibility, single-GPU performance, multi-GPU NCCL bandwidth, multi-node scaling, and a long stability soak to shake out the GPUs that will fail at 3am. This is exactly where cloud-and-Kubernetes skills meet GPU performance, and why the role is so hard to fill.

Why this still matters in ten years

The chips will get faster and the frameworks friendlier, but the questions do not change: is it compute, memory, or communication bound? Is the precision right? Is the cache managed? Is the interconnect the bottleneck? Models will keep growing, which only makes the infrastructure underneath them more valuable, not less. The engineers who understand this stack will be the ones building, and operating, the systems everyone else depends on.

The one thing to remember

A model's speed and cost are decided less by the model and more by the stack it runs on, the precision, the parallelism, the interconnect, and the way inference is batched and cached. Master that stack and you control the two numbers everyone cares about: latency and cost.

Written by the Stratiflux engineering team

We build and run this kind of infrastructure and AI for companies, and train the engineers who do it. If a piece of this is on your plate, we can help.

See what we build Free tools & guides

Buy me a coffee

Everything here is free. If it saved you time or taught you something, a small tip keeps the work going.

Keep reading

AISix things every AI engineer should know (that get increasingly niche)FundamentalsHow the internet actually moves your data FundamentalsInside a cloud data center (and what 'the cloud' really is)