The Modern AI Glossary & Kernel Deep Dive
How AI data centers truly work, the kernel-level magic of networking, and the definitive glossary of modern AI buzzwords.
A deep dive into how AI data centers truly work, the kernel-level magic of networking, and the definitive glossary of modern AI infrastructure buzzwords.
1. AI Datacenter Networking: InfiniBand vs Ethernet
In a traditional web datacenter, servers occasionally talk to each other to fetch database records. In an AI datacenter, thousands of GPUs must constantly synchronize gradients during model training. If one GPU is waiting for data, all of them stall.
Standard Ethernet has too much jitter (unpredictable latency). AI clusters rely on InfiniBand, a specialized network standard with insanely high throughput and low latency. More importantly, it supports RDMA (Remote Direct Memory Access). RDMA allows one GPU to write data directly into the memory of a GPU in a completely different server rack, bypassing the CPU and the operating system kernel entirely.
2. The Kernel & High Availability (HA)
The Kernel is the core of the operating system (like Linux). It acts as the dictator of the hardware, deciding which program gets memory and CPU time. In AI workloads, any time the CPU has to switch into "Kernel Mode" (context switching) to handle a network packet, it wastes precious microseconds. This is why technologies like RDMA and eBPF are critical—they keep traffic in "User Space" or bypass the CPU altogether.
High Availability (HA) ensures that when a server inevitably dies (and they die often in massive clusters running at maximum heat), the system doesn't crash. In Kubernetes, this means having multiple control plane nodes. If a GPU burns out during a massive training run, the cluster automatically checkpoints the model weights, evicts the bad node, and restarts the training on healthy hardware.
3. The Buzzword Dictionary
The AI industry moves fast, and the terminology moves faster. Here is a decoded list of what engineers actually mean:
LoRA (Low-Rank Adaptation)
Instead of retraining a massive 70-billion parameter model from scratch (costing millions), LoRA freezes the main model and injects a tiny, trainable matrix into it. It allows you to fine-tune a model on a consumer GPU in an afternoon.
KV Cache (Key-Value Cache)
When ChatGPT generates a long sentence, it calculates attention scores for every previous word. Re-calculating this for every new word is horribly slow. The KV cache saves the intermediate math in the GPU's memory. If you run out of VRAM for the KV cache, your LLM generation grinds to a halt.
MoE (Mixture of Experts)
Instead of one giant neural network where every neuron fires for every prompt, MoE creates many smaller "expert" networks. A router network looks at the prompt and only activates the relevant experts. This is how massive models run fast: they might have 100B parameters, but only use 15B for any given query.
RAG (Retrieval-Augmented Generation)
LLMs confidently hallucinate facts they don't know. RAG fixes this by intercepting your prompt, searching a vector database for relevant documents (like your company wiki), and pasting that context into the prompt before the AI answers. It forces the AI to read an "open book" before testing.
Quantization (INT8, INT4)
Neural networks naturally use 16-bit or 32-bit floating-point numbers. Quantization rounds these down to 8-bit or 4-bit integers. It drastically shrinks the model size and makes it faster, with a shockingly small loss in accuracy. This is how you run big models on a MacBook.
Written by the Stratiflux engineering team
We build and run this kind of infrastructure and AI for companies, and train the engineers who do it. If a piece of this is on your plate, we can help.