The Silicon Supply Chain & Low-Level Math: ASML to Tensor Cores
From lasers vaporizing tin drops to matrix multiplication at the speed of light. The extreme engineering that makes modern AI possible.
From lasers vaporizing tin drops to matrix multiplication at the speed of light. Here is the extreme engineering stack that makes modern AI possible—starting from the sand, all the way up to CUDA.
1. The Genesis: ASML and Extreme Ultraviolet Lithography (EUV)
Modern AI chips like Nvidia's H100 are so dense with transistors (80 billion+) that you cannot draw their circuits with normal light. The wavelength is too big; it's like trying to draw a microscopic portrait with a thick Sharpie.
Enter ASML, a Dutch company that holds a near monopoly on EUV lithography machines. To make light small enough (13.5 nanometers), these machines drop molten tin into a vacuum chamber and blast each drop with a high-power laser twice—50,000 times a second. This creates a plasma that emits EUV light. That light is bounced off perfectly smooth mirrors to carve unimaginably small transistor patterns onto silicon wafers. If ASML stopped shipping machines tomorrow, global AI hardware progression would halt.
2. The Foundry: TSMC
Nvidia does not actually make their own chips; they design them. They are "fabless." The actual manufacturing is done by TSMC (Taiwan Semiconductor Manufacturing Company).
TSMC takes the EUV machines from ASML and Nvidia's blueprints, and runs wafers through a 1,000-step process over several months. Yield is everything here. A tiny speck of dust ruins a chip. TSMC's mastery is in achieving high "yields" (the percentage of functional chips on a wafer) at the 4nm and 3nm scale, making them the most geopolitically important manufacturer on Earth.
3. The Hardware Bottlenecks: PCIe and Motherboards
Once you have a massive GPU, you have to plug it into a computer. The GPU sits on a Motherboard and communicates with the CPU and system memory via PCIe (Peripheral Component Interconnect Express) lanes.
PCIe acts as a highway. If the GPU is a supercomputer, PCIe is the dirt road leading to it. For AI workloads, feeding data (like LLM weights) from system RAM to the GPU over PCIe is disastrously slow. This is why Nvidia created NVLink—a massive proprietary bridge that allows GPUs to bypass PCIe and talk directly to each other at 900 GB/s, sharing their VRAM as if it were one giant pool.
4. The Silicon Engine: Registers, SMs, and Tensor Cores
A CPU has a few very smart cores optimized for complex logic. A GPU has thousands of "dumb" cores optimized for doing the exact same math simultaneously.
- Registers: The fastest, smallest memory on the chip. Data must be in a register before the GPU can do math on it.
- Streaming Multiprocessors (SMs): The building blocks of the GPU. Each SM contains many cores and its own fast memory (L1 Cache/Shared Memory).
- Tensor Cores: The magic behind the AI boom. While standard CUDA cores do math one operation at a time, Tensor Cores do matrix math (multiply and accumulate) in massive chunks.
Traditional Core (Scalar):
c = a * b (One multiplication per clock cycle).Tensor Core:
D = A * B + C where A, B, and C are 4x4 matrices.In a single clock cycle, a Tensor Core performs 64 floating-point operations. It physically wires the transistors so that the output of one multiplication flows directly into an addition circuit. This hardware-level hardcoding of linear algebra is why GPUs train models thousands of times faster than CPUs.
Interactive: The Tensor Core Math
Change the weights (W) or inputs (X). Neural networks compute the output (Y), and then apply an Activation Function (like ReLU) to introduce non-linearity.
5. The Software Layer: CUDA & GPU Operator
Hardware is useless without software. In 2006, Nvidia released CUDA (Compute Unified Device Architecture). It is an API that allows developers to write standard C++ code and have it execute directly on the GPU's thousands of cores. Before CUDA, programmers had to disguise math problems as graphics rendering tasks to use a GPU. CUDA's 15-year head start is Nvidia's true moat—every major AI framework (PyTorch, TensorFlow) is deeply optimized for CUDA.
When you run massive clusters of GPUs in a datacenter via Kubernetes, manually installing CUDA drivers on every machine is a nightmare. Enter the GPU Operator. It automates the lifecycle of containers needed to provision GPU worker nodes, installing the drivers, the container runtime, and device plugins automatically, turning raw hardware into cloud-ready AI compute.
Written by the Stratiflux engineering team
We build and run this kind of infrastructure and AI for companies, and train the engineers who do it. If a piece of this is on your plate, we can help.