How a frontier AI cluster comes online
Thousands of GPUs in different buildings, on different clouds, become one machine. Here is the software that makes that happen.
Training a frontier model needs thousands of GPUs working as one. But they arrive as crates of metal, in different buildings, sometimes on different clouds. Turning that into a single, reliable, secure machine, and keeping it that way as parts fail, is one of the hardest and most valuable jobs in computing. Here is what it takes.
A cluster is software, not a room of computers
Racks of GPUs are not a cluster any more than a pile of bricks is a house. A cluster is what you get when software wires those machines together: gives them a network, a shared identity, a scheduler that places work, and the ability to notice and route around their own broken parts. The hardware is the easy bit. The platform is the product.
The cluster lifecycle, and why it must be automated
At hyperscale (think hundreds of clusters and tens of thousands of nodes) nothing is done by hand. The whole life of a cluster is code that runs itself, increasingly driven by agents:
- 1ProvisionStand up the cluster from infrastructure-as-code: nodes, network, identity, storage, and the scheduler, the same way every time, reproducibly, across clouds and on-prem.
- 2ValidateAcceptance-test before trusting it: hardware health, GPU-to-GPU bandwidth, multi-node scaling, security posture, and a stability soak to find the nodes that will fail at 3am.
- 3UpdateRoll out new drivers, kernels, and configs safely, a few nodes at a time, with the ability to halt and roll back.
- 4Drain & recoverWhen a node or link fails, automatically move work off it, repair or replace it, and bring it back, ideally before anyone notices.
- 5DecommissionWhen hardware ages out or a lease ends, tear it down cleanly: reclaim addresses, revoke credentials, wipe data, release capacity.
The network: three fabrics stitched together
For GPUs to train together they must exchange enormous amounts of data constantly. The network is not one thing but three nested fabrics, each faster and more local than the last:
That third tier is where cloud networking gets deep. Connecting clusters across regions and providers means designing VPCs and peering them, using a Transit Gateway or Shared VPC as a hub, laying private links like Cloud Interconnect or Direct Connect, and controlling the routes between them all with BGP, the same protocol that runs the internet. Get it right and a model can train across buildings as if they were one room. Get it wrong and your GPUs sit idle waiting on a saturated link.
Inside the cluster: the host network
Within a Kubernetes cluster, every pod needs an address and a policy for who it may talk to. The modern stack here is worth knowing by name:
- CNI (Cilium): gives pods networking, built on eBPF, a way to run safe, fast custom logic inside the Linux kernel, so routing and security happen at near-zero overhead.
- NetworkPolicy: the firewall rules between pods, default-deny, then allow only what is needed.
- Service mesh (Istio, Envoy, Linkerd): manages service-to-service traffic and enforces mTLS, so every internal call is encrypted and authenticated, not trusted just because it is “inside.”
Secure by default, not as an afterthought
When your clusters hold the most valuable models in the world, security cannot be a checklist bolted on later. “Secure by default” means a freshly provisioned cluster is locked down before any workload touches it:
- Pod security standards + admission control: the cluster refuses to run privileged or misconfigured workloads at the door.
- RBAC + least-privilege IAM: every human and service gets the minimum access it needs, and no more.
- Node and container hardening: minimal images, no root, read-only filesystems, locked-down kernels.
- Image provenance: only run images you can prove you built, signed and verified, so a poisoned dependency cannot sneak in (supply-chain security).
Serving the model: inference across clouds
Training is half the story. Once a model exists, it must answer millions of users, and they are spread across AWS, GCP, and Azure, each with different GPUs, networks, and quirks. The job flips from “build one big machine” to “route demand to the cheapest capable capacity, everywhere, reliably.”
- Cost-aware routing: send each request to the most cost-effective accelerator and region that can meet its latency target, the same compute can vary several-fold in price across providers and hardware.
- Capacity planning + autoscaling: match expensive GPU supply to spiky demand without leaving capacity idle (waste) or short (dropped requests).
- CI/CD for models: ship new model versions to millions of users with validation pipelines that catch regressions before they reach anyone.
- Cross-provider abstractions: hide the differences between clouds behind one interface, so the serving system does not need rewriting for each new platform.
Operational excellence: the part that never ends
A platform this large is never “done.” What separates a great team is the discipline around failure: clear incident response, blameless postmortems that turn every outage into a fix, and on-call rotations that stay healthy because the system is built to page humans only when it truly needs them. The goal is the same as a good data center, failure that never reaches the user, achieved through automation rather than heroics.
Why this still matters in ten years
The amount of compute the world points at AI is growing faster than almost anything in tech history, and someone has to bring it online, wire it together, secure it, and keep it serving. The hardware and the clouds will change; the discipline, lifecycle as code, fabrics stitched across providers, secure by default, route to the cheapest capable capacity, recover automatically, will only become more valuable. These are the people who quietly make frontier AI possible.
Written by the Stratiflux engineering team
We build and run this kind of infrastructure and AI for companies, and train the engineers who do it. If a piece of this is on your plate, we can help.