How GPUs actually run AI (and why latency is everything)
The stack underneath every model: GPU hardware, parallel training, the interconnect, and low-latency inference, in plain language.
Models get the headlines. Underneath them sits a stack of hardware and software that most people never see, and it is where the real engineering happens. The people who keep GPUs busy, training fast, and inference cheap are some of the most sought-after engineers in the world. Here is what they actually understand.
Why a GPU, and not a CPU
A CPU is a few very clever workers who can do complicated, varied tasks quickly. A GPU is thousands of simpler workers that all do the same kind of math at the same time. AI is almost entirely one kind of math, matrix multiplication, done at gigantic scale. A transformer is, underneath, a tower of matrix multiplies and a thing called attention. That is exactly the workload a GPU was built to devour.
Inside the chip, just enough vocabulary
You do not need to write GPU code to reason about performance, but a few terms unlock everything else:
- SMs (streaming multiprocessors) are the GPU's compute units. A big GPU has 100+ of them, each running many threads at once.
- Warps are groups of 32 threads that execute together in lockstep. The whole machine is built around doing the same operation across 32 lanes.
- HBM is the GPU's main memory, very fast, but still the usual bottleneck. Moving data to and from it costs more than the math itself, surprisingly often.
- Tensor Cores are special hardware just for matrix multiply, and only fully engaged at lower precision (FP16/BF16/FP8). Run in FP32 and you leave most of the chip idle.
- Occupancy is how busy the SMs are kept. Low occupancy means the GPU is waiting, for memory, for the CPU, or for other GPUs.
What actually makes training slow
Training repeats a simple loop billions of times: run the model forward, measure how wrong it is, compute corrections backward, update the weights. The cost hides in four places:
The usual culprits, and their tells:
- Wrong precision. FP32 ignores the Tensor Cores. Switching to mixed precision (BF16/FP16) is often the single biggest free win.
- Memory pressure. The activations saved for the backward pass can blow past HBM. Techniques like activation checkpointing and sharding trade a little compute for a lot of memory.
- Starved input. If the data pipeline can't keep up, the GPU sits idle between steps. The fix is cheap: more loader workers, prefetching, faster storage.
- Communication. The moment you use more than one GPU, they must agree on the new weights, and waiting on each other can dominate. This is the deep one.
Many GPUs, one model: the kinds of parallelism
Frontier models are far too big for one GPU, so the work is split. There is more than one way to split it, and real training runs combine several:
A related trick, sharding (FSDP, or ZeRO), splits not just the work but the model's parameters, gradients, and optimizer state across GPUs so none of them has to hold the whole thing. It is how a model that could never fit on one GPU trains on a cluster of them.
The interconnect: the part nobody sees until it hurts
Once GPUs must talk constantly, how they are wired together becomes as important as how fast they compute. The library that coordinates the talking is NCCL, and its most important operation is the all-reduce: every GPU shares its gradients and ends up with the combined result. Do that badly and your expensive GPUs spend their time waiting.
- NVLink / NVSwitch: very fast direct links between GPUs in the same server. You want your chattiest GPUs here.
- PCIe: the ordinary bus, much slower. Gradient sync over PCIe is a classic, hidden bottleneck.
- InfiniBand / RoCE: high-speed networking between servers, so a training run can span hundreds of machines.
Inference is a different animal
Serving a trained model has its own physics, and it splits into two phases that behave nothing alike:
- 1Prefill: read the promptThe model processes your entire prompt at once. This is compute-heavy and parallel, the GPU is busy. A long prompt means a long prefill, which is most of your time-to-first-token.
- 2Decode: write the answerThe model then generates the answer one token at a time, each step depending on the last. This is memory-bound, not compute-bound, and it is where most of the wall-clock time of a long answer goes.
The concepts that make serving fast all come from managing those two phases:
- KV cache. So it doesn't re-read the whole conversation every token, the model caches the attention keys and values. This cache can be huge and is usually what limits how many requests fit on a GPU.
- PagedAttention. Manage that cache in small pages instead of one big block, so memory isn't wasted and many requests of different lengths can share the GPU.
- Continuous batching. Instead of a fixed batch where everyone waits for the slowest, add and remove requests on the fly. This alone often multiplies throughput several times over.
- Quantization. Run the model at lower precision (INT8, FP8). It frees memory, fits more concurrent requests, and speeds decode, at a small, measurable quality cost.
This is why serving frameworks like vLLM, TensorRT-LLM, and SGLang exist: they implement exactly these tricks. Putting one in front of a model is frequently a 2-4x throughput win on the same hardware, the cheapest performance you will ever buy.
How an expert actually finds the problem
The skill that ties it all together is profiling, and the method is always the same: top-down, cheapest tool first.
- 1Watch the high-level numbersThroughput (tokens or samples per second), GPU utilization, memory, and step-time breakdown. nvidia-smi and DCGM dashboards. Is the GPU even busy?
- 2Profile the frameworkPyTorch Profiler shows which operations cost the most and separates CPU overhead from GPU time. Most "my training is slow" answers are here.
- 3Look at the whole timelineNsight Systems shows the CPU, the GPU, memory copies, and NCCL on one timeline, exposing idle gaps, stalls, and communication waits.
- 4Open the kernelIf one operation dominates, Nsight Compute inspects it: occupancy, memory access, Tensor Core usage. This is the deepest, rarest rung.
And it all runs on Kubernetes
In a real cluster, none of this is hand-placed. GPUs are exposed to Kubernetes through a device plugin; the NVIDIA GPU Operator installs the drivers and runtime; DCGM exports the GPU metrics that fill the dashboards. Before a new cluster is trusted with a million-dollar training run, it is acceptance-tested: hardware health, driver compatibility, single-GPU performance, multi-GPU NCCL bandwidth, multi-node scaling, and a long stability soak to shake out the GPUs that will fail at 3am. This is exactly where cloud-and-Kubernetes skills meet GPU performance, and why the role is so hard to fill.
Why this still matters in ten years
The chips will get faster and the frameworks friendlier, but the questions do not change: is it compute, memory, or communication bound? Is the precision right? Is the cache managed? Is the interconnect the bottleneck? Models will keep growing, which only makes the infrastructure underneath them more valuable, not less. The engineers who understand this stack will be the ones building, and operating, the systems everyone else depends on.
Written by the Stratiflux engineering team
We build and run this kind of infrastructure and AI for companies, and train the engineers who do it. If a piece of this is on your plate, we can help.