All articles
AI 13 min readJune 13, 2026

How a frontier AI cluster comes online

Thousands of GPUs in different buildings, on different clouds, become one machine. Here is the software that makes that happen.

Training a frontier model needs thousands of GPUs working as one. But they arrive as crates of metal, in different buildings, sometimes on different clouds. Turning that into a single, reliable, secure machine, and keeping it that way as parts fail, is one of the hardest and most valuable jobs in computing. Here is what it takes.

Note
This is the world of our premium AI Cluster & Inference Platform track, the platform engineering that frontier labs like Anthropic, and GPU clouds like Nebius, hire whole teams to do.

A cluster is software, not a room of computers

Racks of GPUs are not a cluster any more than a pile of bricks is a house. A cluster is what you get when software wires those machines together: gives them a network, a shared identity, a scheduler that places work, and the ability to notice and route around their own broken parts. The hardware is the easy bit. The platform is the product.

The cluster lifecycle, and why it must be automated

At hyperscale (think hundreds of clusters and tens of thousands of nodes) nothing is done by hand. The whole life of a cluster is code that runs itself, increasingly driven by agents:

  1. 1
    Provision
    Stand up the cluster from infrastructure-as-code: nodes, network, identity, storage, and the scheduler, the same way every time, reproducibly, across clouds and on-prem.
  2. 2
    Validate
    Acceptance-test before trusting it: hardware health, GPU-to-GPU bandwidth, multi-node scaling, security posture, and a stability soak to find the nodes that will fail at 3am.
  3. 3
    Update
    Roll out new drivers, kernels, and configs safely, a few nodes at a time, with the ability to halt and roll back.
  4. 4
    Drain & recover
    When a node or link fails, automatically move work off it, repair or replace it, and bring it back, ideally before anyone notices.
  5. 5
    Decommission
    When hardware ages out or a lease ends, tear it down cleanly: reclaim addresses, revoke credentials, wipe data, release capacity.
The cluster lifecycle. Every stage is defined in code (Terraform) and orchestrated as a durable workflow (Temporal, Argo) so it can run, pause, and recover without a human babysitting it.
The key idea
The skill is treating a fleet of clusters like cattle, not pets. Any single node, or whole cluster, can die, and the system keeps running because provisioning and recovery are just code that runs again. That mindset, and the workflow engines that make it durable, is the heart of the job.

The network: three fabrics stitched together

For GPUs to train together they must exchange enormous amounts of data constantly. The network is not one thing but three nested fabrics, each faster and more local than the last:

Inside a server: NVLink / NVSwitchfastest, GPU-to-GPU in one box
Inside a cluster: InfiniBand / RoCEhigh-bandwidth, node-to-node
Between clusters & clouds: Interconnect / peeringVPC peering, Transit Gateway, Direct Connect, BGP
The three tiers of cluster networking. A performance problem at any tier stalls the GPUs above it, which is why network engineering is inseparable from AI infrastructure.

That third tier is where cloud networking gets deep. Connecting clusters across regions and providers means designing VPCs and peering them, using a Transit Gateway or Shared VPC as a hub, laying private links like Cloud Interconnect or Direct Connect, and controlling the routes between them all with BGP, the same protocol that runs the internet. Get it right and a model can train across buildings as if they were one room. Get it wrong and your GPUs sit idle waiting on a saturated link.

Inside the cluster: the host network

Within a Kubernetes cluster, every pod needs an address and a policy for who it may talk to. The modern stack here is worth knowing by name:

  • CNI (Cilium): gives pods networking, built on eBPF, a way to run safe, fast custom logic inside the Linux kernel, so routing and security happen at near-zero overhead.
  • NetworkPolicy: the firewall rules between pods, default-deny, then allow only what is needed.
  • Service mesh (Istio, Envoy, Linkerd): manages service-to-service traffic and enforces mTLS, so every internal call is encrypted and authenticated, not trusted just because it is “inside.”

Secure by default, not as an afterthought

When your clusters hold the most valuable models in the world, security cannot be a checklist bolted on later. “Secure by default” means a freshly provisioned cluster is locked down before any workload touches it:

  • Pod security standards + admission control: the cluster refuses to run privileged or misconfigured workloads at the door.
  • RBAC + least-privilege IAM: every human and service gets the minimum access it needs, and no more.
  • Node and container hardening: minimal images, no root, read-only filesystems, locked-down kernels.
  • Image provenance: only run images you can prove you built, signed and verified, so a poisoned dependency cannot sneak in (supply-chain security).

Serving the model: inference across clouds

Training is half the story. Once a model exists, it must answer millions of users, and they are spread across AWS, GCP, and Azure, each with different GPUs, networks, and quirks. The job flips from “build one big machine” to “route demand to the cheapest capable capacity, everywhere, reliably.”

Request
from anywhere
Route
by latency + cost
Cheapest GPU/region
across CSPs
Serve
autoscaled
Multi-cloud inference: a request is routed to the region and accelerator that can serve it fastest and cheapest, with capacity matched to demand.
  • Cost-aware routing: send each request to the most cost-effective accelerator and region that can meet its latency target, the same compute can vary several-fold in price across providers and hardware.
  • Capacity planning + autoscaling: match expensive GPU supply to spiky demand without leaving capacity idle (waste) or short (dropped requests).
  • CI/CD for models: ship new model versions to millions of users with validation pipelines that catch regressions before they reach anyone.
  • Cross-provider abstractions: hide the differences between clouds behind one interface, so the serving system does not need rewriting for each new platform.

Operational excellence: the part that never ends

A platform this large is never “done.” What separates a great team is the discipline around failure: clear incident response, blameless postmortems that turn every outage into a fix, and on-call rotations that stay healthy because the system is built to page humans only when it truly needs them. The goal is the same as a good data center, failure that never reaches the user, achieved through automation rather than heroics.

Why this still matters in ten years

The amount of compute the world points at AI is growing faster than almost anything in tech history, and someone has to bring it online, wire it together, secure it, and keep it serving. The hardware and the clouds will change; the discipline, lifecycle as code, fabrics stitched across providers, secure by default, route to the cheapest capable capacity, recover automatically, will only become more valuable. These are the people who quietly make frontier AI possible.

The one thing to remember
At scale, a cluster is a software system that happens to have GPUs in it. The leverage is in the automation, the network design, and the operational discipline, not in any single machine. Master those and you can run compute that most companies cannot even buy.

Written by the Stratiflux engineering team

We build and run this kind of infrastructure and AI for companies, and train the engineers who do it. If a piece of this is on your plate, we can help.

Deep Dive Locked

Enter your email to instantly claim 10 free credits and read the rest of this highly technical deep dive.