AI 16 min readJune 14, 2026

The Silicon Supply Chain & Low-Level Math: ASML to Tensor Cores

From lasers vaporizing tin drops to matrix multiplication at the speed of light. The extreme engineering that makes modern AI possible.

From lasers vaporizing tin drops to matrix multiplication at the speed of light. Here is the extreme engineering stack that makes modern AI possible, starting from the sand, all the way up to CUDA.

1. The Genesis: ASML and Extreme Ultraviolet Lithography (EUV)

Modern AI chips like Nvidia's H100 are so dense with transistors (80 billion+) that you cannot draw their circuits with normal light. The wavelength is too big; it's like trying to draw a microscopic portrait with a thick Sharpie.

Enter ASML, a Dutch company that holds a near monopoly on EUV lithography machines. To make light small enough (13.5 nanometers), these machines drop molten tin into a vacuum chamber and blast each drop with a high-power laser twice, 50,000 times a second. This creates a plasma that emits EUV light. That light is bounced off perfectly smooth mirrors to carve unimaginably small transistor patterns onto silicon wafers. If ASML stopped shipping machines tomorrow, global AI hardware progression would halt.

2. The Foundry: TSMC

Nvidia does not actually make their own chips; they design them. They are "fabless." The actual manufacturing is done by TSMC (Taiwan Semiconductor Manufacturing Company).

TSMC takes the EUV machines from ASML and Nvidia's blueprints, and runs wafers through a 1,000-step process over several months. Yield is everything here. A tiny speck of dust ruins a chip. TSMC's mastery is in achieving high "yields" (the percentage of functional chips on a wafer) at the 4nm and 3nm scale, making them the most geopolitically important manufacturer on Earth.

3. The Hardware Bottlenecks: PCIe and Motherboards

Once you have a massive GPU, you have to plug it into a computer. The GPU sits on a Motherboard and communicates with the CPU and system memory via PCIe (Peripheral Component Interconnect Express) lanes.

PCIe acts as a highway. If the GPU is a supercomputer, PCIe is the dirt road leading to it. For AI workloads, feeding data (like LLM weights) from system RAM to the GPU over PCIe is disastrously slow. This is why Nvidia created NVLink, a massive proprietary bridge that allows GPUs to bypass PCIe and talk directly to each other at 900 GB/s, sharing their VRAM as if it were one giant pool.

4. The Silicon Engine: Registers, SMs, and Tensor Cores

A CPU has a few very smart cores optimized for complex logic. A GPU has thousands of "dumb" cores optimized for doing the exact same math simultaneously.

Registers: The fastest, smallest memory on the chip. Data must be in a register before the GPU can do math on it.
Streaming Multiprocessors (SMs): The building blocks of the GPU. Each SM contains many cores and its own fast memory (L1 Cache/Shared Memory).
Tensor Cores: The magic behind the AI boom. While standard CUDA cores do math one operation at a time, Tensor Cores do matrix math (multiply and accumulate) in massive chunks.

The Low-Level Math of Tensor Cores

AI is mostly Matrix Multiplication: multiplying rows of numbers by columns of numbers.

Traditional Core (Scalar): c = a * b (One multiplication per clock cycle).
Tensor Core: D = A * B + C where A, B, and C are 4x4 matrices.
In a single clock cycle, a Tensor Core performs 64 floating-point operations. It physically wires the transistors so that the output of one multiplication flows directly into an addition circuit. This hardware-level hardcoding of linear algebra is why GPUs train models thousands of times faster than CPUs.

Interactive: The Tensor Core Math

Change the weights (W) or inputs (X). Neural networks compute the output (Y), and then apply an Activation Function (like ReLU) to introduce non-linearity.

Output (Y)

Weights (W)

Inputs (X)

Bias (B)

Apply ReLU Activation (max(0, val))

5. The Software Layer: CUDA & GPU Operator

Hardware is useless without software. In 2006, Nvidia released CUDA (Compute Unified Device Architecture). It is an API that allows developers to write standard C++ code and have it execute directly on the GPU's thousands of cores. Before CUDA, programmers had to disguise math problems as graphics rendering tasks to use a GPU. CUDA's 15-year head start is Nvidia's true moat: every major AI framework (PyTorch, TensorFlow) is deeply optimized for CUDA.

When you run massive clusters of GPUs in a datacenter via Kubernetes, manually installing CUDA drivers on every machine is a nightmare. Enter the GPU Operator. It automates the lifecycle of containers needed to provision GPU worker nodes, installing the drivers, the container runtime, and device plugins automatically, turning raw hardware into cloud-ready AI compute.

Written by the Stratiflux engineering team

We build and run this kind of infrastructure and AI for companies, and train the engineers who do it. If a piece of this is on your plate, we can help.

See what we build Free tools & guides

Buy me a coffee

Everything here is free. If it saved you time or taught you something, a small tip keeps the work going.

Keep reading

AISix things every AI engineer should know (that get increasingly niche)FundamentalsHow the internet actually moves your data FundamentalsInside a cloud data center (and what 'the cloud' really is)