April 23, 2026
Compute Trades [In Progress]
In Claude Code:
"Add dark mode functionality to my blog."
A few minutes later, seven files were edited, tests passed, and dark mode worked. From keystroke to commit, maybe 4 minutes of wall-clock time.
But what actually happened in those 4 minutes? What hardware processed it? What physical infrastructure made it possible? And most importantly — who captures the margin when millions of users run the same kind of agentic loop simultaneously?
This post traces a single inference request through the serving stack that produces it, down the supply chain that builds the hardware, and into the margin stack that decides where capital should go.
Inspired by Daniel Gross's AGI Trades. Partially to deeply understand the stack. Partially to try and make some money.
What actually happened
Here's the literal path of one request — eleven stations, from keystroke on my laptop to the first token painting on my screen. Read the prose paragraphs and you'll understand the trace top to bottom; the diagrams show what's actually on the wire; the sidenotes (right margin on desktop, between paragraphs on mobile) define any term that wasn't introduced yet. Zero prior LLM knowledge required — every jargon term gets glossed the first time it appears.
1. You hit return — a 50K-token request leaves your laptop
Claude Code packages the message, system prompt, tool schemas, and conversation history into a JSON body.
~50K tokens of context
Where it runs
- hw
- Your laptop CPU (M-series / x86) · Node.js process
- net
- localhost — no network yet
- site
- your desk
POST https://api.anthropic.com/v1/messages
{
"model": "claude-opus-4-7",
"max_tokens": 4096,
"stream": true,
"system": "You are Claude Code, Anthropic's official CLI...",
"tools": [
{ "name": "Read", "input_schema": { ... } },
{ "name": "Edit", "input_schema": { ... } },
{ "name": "Bash", "input_schema": { ... } },
...
],
"messages": [
{ "role": "user",
"content": "Add dark mode functionality to my blog." }
]
}x-api-key, anthropic-version: 2023-06-01, and content-type: application/json headers attached. The payload looks tiny — but the system prompt and tool schemas alone are ~5–8K tokens before the user types anything.
The thing that looks tiny on screen — "Add dark mode functionality to my blog." — is the smallest piece of what gets sent. Claude Code wraps your message in a JSON body that also carries the system prompt (Anthropic's pre-loaded instructions for how Claude Code should behave), every tool schema (the JSON shape of every action the model can take: Read, Edit, Bash, etc.), and the entire conversation history of this session. That bundle is ~50,000 tokens before you typed anything. Your dark-mode sentence itself is 7 tokens. The other 49,993 are context the model needs to act on those 7.
Why does this matter? Because every cost in the rest of this section scales with how big that bundle is. GPU compute, memory bandwidth, network latency, cache hit rate — all of them are downstream of "how many tokens did you just send."
2. TLS to Cloudflare's edge — the boring half-second
api.anthropic.com resolves to an Anthropic-owned IP block, terminating at Cloudflare's edge.
TLS 1.3 · ALPN h2
Where it runs
- hw
- Cloudflare edge server (commodity x86 + Linux) — TLS terminator
- net
- your ISP → public internet → BGP-anycast to nearest Cloudflare PoP (e.g. DEN, SJC, IAD)
- site
- Cloudflare PoP — ~10–30 ms RTT from you
$ dig +short api.anthropic.com 160.79.104.10 ← Anthropic-owned /24 (ARIN: AP-2440) $ curl -vI https://api.anthropic.com * TLSv1.3 (OUT), Client hello * SSL connection using TLSv1.3 / AEAD-CHACHA20-POLY1305-SHA256 * ALPN: server accepted h2 * Server certificate: CN=api.anthropic.com, issuer=WE1 (Google Trust Services) < HTTP/2 200 < server: cloudflare < cf-ray: 9f32cff6...-DEN
Anthropic announces its own /24 through Cloudflare's edge (BYOIP / Magic Transit pattern). TLS 1.3 + HTTP/2 over TCP. No HTTP/3 (Alt-Svc not advertised). The cf-ray tells you the PoP — -DEN is Denver.
The JSON body becomes bytes on a TLS-encrypted HTTP/2 connection to api.anthropic.com, which resolves (via DNS) to an Anthropic-owned IP block fronted by Cloudflare's edge network. Round-trip from your laptop to the nearest Cloudflare PoP is ~10–30 ms. This is the same plumbing every HTTPS request on the internet uses; I mention it only because the next station is where things stop being generic.
3. Routing — the cache decides the economics
Past the edge, the request is authenticated, rate-checked, then steered to the replica that already holds this prefix's KV cache.
~5 min cache TTL
Where it runs
- hw
- Anthropic frontend pods — x86 CPUs running Envoy / Go / Python services. No accelerators yet.
- net
- Cloudflare → Anthropic origin over Magic Transit, then internal VPC mesh (AWS us-east-1 / us-west-2 or GCP us-central1)
- site
- AWS or GCP region — same region as the GPU/Trainium fleet, sub-1ms to it
prefix_hash = sha256(system + tools[] + messages[0..N-1]) replica = consistent_hash(prefix_hash) → "decode-pool/shard-43" cache_lookup(prefix_hash) → HIT ↳ skip prefill for the 49,950 cached tokens ↳ only the new 50-token tail needs prefill
Prompt caching is the load-bearing economics for agents. Anthropic's docs commit to per-replica prefix locality (otherwise the cache couldn't hit), so the LB has to be prefix-aware. Cache reads bill at ~10% of input price; cache writes at ~125%. Claude Code's 84–92% hit rate is the difference between agents being routine and uneconomic.
Past the edge, Anthropic's serving frontend authenticates your API key, checks rate limits, and then does the move that makes agents economic: prefix-affinity routing. It hashes the system prompt + tools + conversation history, looks up which GPU replica recently saw that exact prefix, and steers your request to the same replica. That replica still has the K/V projections for those 49,950 prefix tokens sitting in its KV cache (the model's working memory of prior tokens). It skips redoing prefill on them. Only the 50-token tail — your new dark-mode sentence plus the immediately preceding turn — needs fresh compute.
The economics: cache reads bill at ~10% of the base input rate; cache writes bill at ~125%. For a Claude Code session with a stable 50K prefix, the effective input rate drops from $3/M to roughly $0.50/M — a ~6× reduction. Without this trick, the next twenty stations would run on the full 50,000 tokens of context, ~22 times per task. Your bill would be ~6× higher and the user-perceived latency would jump from ~150 ms to ~3 seconds before the first token streamed back.
4. Tokenization — words become integers (this is BPE)
The new tail of the conversation is split into subword tokens. Byte-level BPE — same family as cl100k_base / o200k_base.
"add dark mode functionality" = 4 tokens
Where it runs
- hw
- x86 CPU on the same serving pod — pure-CPU op, often Rust/C++ behind Python
- net
- in-process — function call, no RPC
- site
- Anthropic serving frontend, same datacenter rack as the accelerators
Input string
"add dark mode functionality"
BPE merges (cl100k_base trace)
bytes: a d d d a r k m o d e f u n c t i o n a l i t y
↓ merge "ad" + "d" → "add"
↓ merge " d" + "a" + "r" + "k" → " dark"
↓ merge " m" + "o" + "d" + "e" → " mode"
↓ merge " function" + "ality" → " functionality" (1 token!)
(no leading space → 2)Final tokens
"add"
id: 723
" dark"
id: 6453
" mode"
id: 3941
" functionality"
id: 15293
IDs above are cl100k_base (GPT-4) — Claude's tokenizer is private, but its English token counts come within ~5–10% of cl100k. The leading space is part of each mid-sentence token. " functionality" is one token; the space-less form splits to function + ality.
The new tail of your conversation is a string. Models don't read strings; they read sequences of integers from a fixed vocabulary. The tokenizer's job is to break the string into the largest matching pieces in its vocabulary and emit the corresponding integer ID for each. Claude uses BPE (byte-pair encoding), the same family as OpenAI's cl100k_base and o200k_base tokenizers.
For our request, "add dark mode functionality" becomes 4 integers: add (723), dark (6453), mode (3941), functionality (15293). Note the leading spaces — they're part of the token. functionality is one token; without the space it splits to function + ality. Anchor this: every later step's cost scales with how many of these integers you produce. A model with a denser tokenizer (more characters per token on average) is cheaper for the same prompt.
5. Embedding lookup — integers become vectors
Each token ID becomes a row of an 8,192-dim BF16 vector. This is the token's entire presence on the GPU from now on.
128,256 × 8,192 · BF16 · 2.10 GB
Where it runs
- hw
- Accelerator: H100/H200 (Hopper, TSMC N4) · or Trainium2 · or TPU v5p. embed_tokens lives in HBM3/HBM3E next to the die.
- net
- PCIe Gen5 host→device the first time the model is loaded; after that, on-chip.
- site
- Inside one GPU's HBM stacks — the gather op never leaves the package.
hidden_states = embed_tokens[ [723, 6453, 3941, 15293] ]
↑
row gather from a 128,256 × 8,192 matrix
shape: [batch=1, seq_len=4, hidden=8192] dtype: bfloat16Row 6453 — what " dark" literally is
embed_tokens[6453] = [ 0.0237, -0.1418, 0.0091, -0.0203, 0.0044, 0.0612, -0.0089, -0.0341, 0.0118, -0.0276, 0.0501, -0.0157, 0.0024, 0.0193, -0.0408, 0.0612, ... // 8,192 BF16 floats total — most in [-0.05, 0.05], σ ≈ 0.013 ... 0.0177, -0.0264, 0.0512 ]
On the GPU it's a gather: one CUDA block per token, one thread per dim. Under tensor parallelism (Megatron VocabParallelEmbedding), the vocab dim is sharded across ranks; an all-reduce stitches the result. Most weights live in the band [-0.05, 0.05]; a few outlier dimensions run 5–10× larger (the ones quantization schemes have to special-case).
Each integer ID becomes a row of a giant lookup table called the embedding matrix. For Llama-3 70B that table is 128,256 rows × 8,192 columns in BF16 — 2.1 GB living in HBM next to the GPU die. The lookup is a simple gather: pull row 723, pull row 6453, pull row 3941, pull row 15293. Each row is an 8,192-float vector.
That 8,192-float vector for dark (row 6453) is the token's entire presence on the GPU from this point on. Whatever dark "means" to Llama-3, that meaning is 8,192 specific BF16 numbers — most clustered in [-0.05, 0.05], a few outlier dimensions running 5–10× larger. The diagram above shows the actual first row of those numbers. Everything downstream is math on these vectors.
6. RoPE — adding position by rotating
Position information isn't added — it's rotated in. Each Q/K vector pair gets twisted by an angle proportional to position.
θᵢ = pos · 500,000^(−2i/128)
Where it runs
- hw
- Same GPU's tensor cores — fused into the attention kernel, runs on Q/K tensors already in registers/SRAM
- net
- none — on-chip
- site
- Inside the streaming multiprocessors (SMs) of one GPU
for each attention head (head_dim = 128): pair dims (0,1), (2,3), ..., (126,127) rotate pair i by angle θᵢ(pos) = pos · base^(−2i/128) Llama-3 base = 500,000 (raised from 10K to extend context to 128K)
Rotation angles (radians) by position × dim-pair
| position | pair (0,1) | pair (32,33) | pair (64,65) | pair (126,127) |
|---|---|---|---|---|
| 0 ("add") | 0.000 | 0.000 | 0.000 | 0.000 |
| 1 (" dark") | 1.000 | 0.0145 | 2.1e-4 | ~2e-6 |
| 3 (" functionality") | 3.000 | 0.0436 | 6.3e-4 | ~6e-6 |
| 128,000 | 128k mod 2π | 1859 | 26.9 | 0.247 |
Low-frequency pairs spin fast (encode local position). High-frequency pairs barely move (encode coarse, long-range position). This is why context-window extension is software, not architecture: rescale the base, fine-tune briefly, and a model trained at 8K reaches 128K.
The embedding table is positional-blind: row 6453 for dark is the same vector whether dark appears at position 0 or position 50,000. But position matters — "add dark mode" and "mode dark add" should produce very different attention patterns. RoPE (rotary position embedding) injects position by rotating each token's Q and K vectors by an angle that grows with position. Pair up the 128 dimensions of each attention head, rotate pair i by angle θᵢ = pos · base^(-2i/128), done.
This is the trick that lets context-window extension be cheap. A model trained on 8K-token sequences can be stretched to 128K or 200K by rescaling the rotation base and fine-tuning on under 1% of original pretraining tokens. For our request, at position 0 (add) nothing rotates. At position 1 ( dark) every pair rotates by a position-1 angle. At position 50,000 (deep in the conversation history) most pairs have spun around many times; only the highest-frequency pairs (the ones that encode long-range information) still carry coherent signal.
7. The transformer block — 80 layers of attention + MLP
Each layer: RMSNorm → GQA Attention → residual → RMSNorm → SwiGLU MLP (or MoE) → residual.
Llama-3 70B · GQA 64:8 · head_dim 128
Where it runs
- hw
- Tensor parallelism across 8 GPUs in one node (e.g. H100 SXM5, 80 GB HBM3 each, 989 TFLOPS BF16). Weights sharded by column.
- net
- NVLink 4 (900 GB/s bidirectional) for the 2 all-reduces per layer × 80 layers = 160 collectives per forward pass
- site
- One HGX node: 8 GPUs on one baseboard, sharing NVSwitch fabric. Same physical chassis, ~50cm of copper.
for layer in 0..79: # ── attention ───────────────────────────────────── h = rms_norm(x) Q = h @ W_q # [1, S, 8192] · [8192, 8192] → 64 query heads × 128 dim K = h @ W_k # [1, S, 8192] · [8192, 1024] → 8 KV heads (GQA 8:1) V = h @ W_v # [1, S, 8192] · [8192, 1024] → 8 KV heads Q, K = rope(Q, position), rope(K, position) KV_cache[layer].append(K, V) # ← persistent state scores = (Q @ K_cache.T) / sqrt(128) scores = scores + causal_mask attn = softmax(scores) @ V_cache # FlashAttention tiles in SRAM x = x + attn @ W_o # ── MLP (or MoE) ────────────────────────────────── h = rms_norm(x) x = x + (silu(h @ W_gate) * (h @ W_up)) @ W_down # SwiGLU, intermediate=28,672 # final: rms_norm → lm_head → logits over 128,256 tokens
Same code path runs in both phases. S=4 at prefill (whole prompt in parallel, compute-bound, ~50–70% MFU) → S=1 at decode (one token at a time, bandwidth-bound, ~5–15% MFU). For MoE models (DeepSeek-V3, Llama 4, GPT-OSS), the MLP step becomes a router → top-K expert selection → all-to-all dispatch across the cluster → per-expert SwiGLU → all-to-all combine.
This is the part that costs the money. Inside Llama-3 70B, each of the 4 dark-mode tokens (plus the 49,996 prefix tokens) goes through 80 identical transformer layers in sequence. Each layer is, in order: normalize → attention → add residual → normalize → MLP → add residual. Same operation, 80 stacked copies.
The diagram above shows the literal Python-like pseudocode for one layer. The same code path runs in both phases of inference, but with one critical asymmetry: at prefill all 4 input tokens flow through this layer together in one matrix multiplication (the GPU's tensor cores light up at ~50–70% peak utilization — they were built for this). At decode, only one new token at a time flows through (the matmul shrinks to a vector-matrix product, tensor cores sit ~90% idle, and the bottleneck shifts from compute to memory bandwidth — reading the 140 GB of weights becomes the wall-clock floor).
This is the asymmetry that every modern serving technique is a response to:
The single asymmetry
Prefill and decode run on the same GPU but hit different walls. Every serving technique of the last five years is a response.
Arithmetic intensity
200–400 FLOPs / byte
Tensor core utilization
Wall-clock floor
Scales with prompt length²
All input tokens processed in parallel. Tensor cores light up. This phase looks a lot like training.
Arithmetic intensity
1–2 FLOPs / byte
Tensor core utilization
Wall-clock floor
model_weight_bytes / HBM_bandwidth
One output token at a time. Full weight + KV read per step. ~42 ms/token for Llama-3 70B at FP16 on H100.
ConsequenceThe two phases want different hardware. Colocating them wastes either FLOPs (during decode) or bandwidth (during prefill). Disaggregated serving, FP4 inference, PagedAttention, continuous batching — all solutions to this one asymmetry.
Why does the asymmetry matter? Because if you size your fleet for prefill, decode wastes bandwidth. If you size for decode, prefill wastes FLOPs. Continuous batching, PagedAttention, FP4 weights on Blackwell, speculative decoding, disaggregated serving — every line on the curve of token-price decline over the last three years is a clever response to running both phases on one piece of hardware. We unpack each later in this section.
8. The KV cache — what long context literally costs
Every K and V projection is kept in HBM so decode doesn't recompute. The cache, not the weights, is what bounds context.
320 KB / token · per request
Where it runs
- hw
- HBM3 / HBM3E stacks soldered next to the GPU die via TSMC CoWoS interposer. 80 GB per H100, 141 GB per H200, 192 GB per B200.
- net
- Reads stream over the GPU↔HBM link at 3.35 TB/s (H100) or 4.8 TB/s (H200) every decode step
- site
- Same package as the GPU die — millimeters apart on a silicon interposer
per token, per layer (BF16, GQA 8:1): K = 8 KV heads × 128 head_dim × 2 bytes = 2,048 B V = 8 KV heads × 128 head_dim × 2 bytes = 2,048 B across 80 layers: 80 × (2,048 + 2,048) = 327,680 B ≈ 320 KB / token
What that means in practice
Our prompt
4 tokens
→ 1.28 MB
Claude Code session
50,000 tokens
→ ~16 GB
1M-token context
1,000,000 tokens
→ ~320 GB (must shard)
vLLM's PagedAttention chunks this into 16-token blocks (5 MB each), so two requests sharing a 50K prefix point at the same physical pages — copy-on-write only on divergence. The cached prefix for our request was already resident on this replica from the prior turn, which is why prefill skipped 49,950 of those tokens.
At each of the 80 attention layers, the K and V projections of every token are kept in HBM so the next decode step doesn't recompute them. The size formula is mechanical: per token, per layer, with Llama-3 70B's grouped-query attention (8 KV heads, head dim 128, BF16), the cache costs 2 × 8 × 128 × 2 = 4,096 bytes. Across 80 layers: ~320 KB per token.
At small scale that's small numbers. At Claude Code's working scale (50K tokens of context) the cache is ~16 GB — comparable to the weights themselves. At 1M tokens of context, the cache is ~320 GB, larger than any single GPU's HBM, and the model has to be sharded across multiple GPUs to fit it. KV-cache size, not weight size, is what bounds how long a context window can be in production. The downstream consequences (sliding-window attention, attention sinks, INT8/FP8/NVFP4 KV quantization, DeepSeek's MLA) are all attempts to shrink it.
9. lm_head + sample — picking the next token
The final hidden vector is projected back to 128,256 logits. Softmax. Sample. That ID is the next token.
128,256 logits · T ≈ 0.0 for code
Where it runs
- hw
- Final matmul on tensor cores, then softmax + sampling kernel. Often the lm_head matrix is sharded across the same TP group — ends with one all-reduce.
- net
- NVLink for the all-reduce, then PCIe back to host CPU for the sampled int (sometimes kept on-device)
- site
- Same HGX node — the sampled token ID hops to the host CPU for stream framing
final_hidden : [1, 1, 8192] # last token's residual stream
final_hidden = rms_norm(final_hidden)
logits = final_hidden @ lm_head.T # lm_head: [128256, 8192], BF16, 2.10 GB
# → [1, 1, 128256] real-valued
# sampling (Claude Code: T ≈ 0.0 for deterministic code generation)
probs = softmax(logits / T)
probs = top_p(top_k(probs, 50), 0.9)
next_token_id = multinomial(probs, 1) # → e.g. 40, "I"Top-5 candidates after softmax (illustrative)
Detokenize: tokenizer.decode(40) → "I". Append to output buffer. Loop: pass the new token back through the full 80-layer stack — but only the new token, because the KV cache carries everything before it. Stop on eos_token_id or atool_use block close.
At the end of the 80th layer, the model has produced one 8,192-vector per input token. To generate the next output token, take the last token's vector and project it through lm_head — a 128,256 × 8,192 matmul that produces 128,256 raw scores called logits, one per possible next token. Apply softmax to turn the logits into probabilities. Apply a sampling rule (greedy, top-k, top-p, nucleus). Pick one integer ID. Detokenize.
The diagram above shows the top-5 candidates after softmax for the first response token after our dark-mode prompt: "I" at 61%, "Looking" at 18%, "Let" at 9%, "To" at 5%, "First" at 3%. With T ≈ 0, "I" wins. Detokenize: integer ID 40 → the literal character I. That's the first output token. Then loop: feed "I" back through the 80-layer stack — but only "I", because the KV cache below still holds everything else.
10. Streaming back — SSE token by token
Every token is flushed the instant it's sampled. HTTP/2 chunked transfer holds the connection open.
delta per token · ~50–100 ms cadence
Where it runs
- hw
- Frontend pod CPU detokenizes the int → bytes, frames the SSE event, writes to socket. Cloudflare passes the bytes through unchanged.
- net
- Datacenter VPC → Cloudflare PoP → public internet → your ISP → your laptop NIC. Same TCP connection that opened in step 02.
- site
- Reverses the path through every layer of step 02–03
event: message_start
data: {"type":"message_start","message":{"id":"msg_01ABC...",
"model":"claude-opus-4-7","usage":{"input_tokens":50012,
"cache_read_input_tokens":49950,"output_tokens":1}}}
event: content_block_start
data: {"type":"content_block_start","index":0,
"content_block":{"type":"text","text":""}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,
"delta":{"type":"text_delta","text":"I"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,
"delta":{"type":"text_delta","text":"'ll"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,
"delta":{"type":"text_delta","text":" start"}}
...Each text_delta is one decoded token. Latency from sampler → terminal is sub-100ms; that's why text appears word by word. cache_read_input_tokens: 49,950 is the prompt-cache hit being billed at the discounted rate. ping events interleave as keepalives for long generations.
The instant a token is sampled, it's flushed to your screen via Server-Sent Events (SSE) over the same HTTP/2 connection from station 2. There's no buffering for effect — the word-by-word appearance you see is the literal sampler-to-terminal cadence. Each text_delta event carries one decoded token. The cache_read_input_tokens: 49,950 field in the first frame is the prompt-cache hit from station 3, billed at the discounted rate.
11. Tool use — the loop that turns one request into twenty-two
When the model emits a tool_use block, the stream stops, Claude Code runs the tool, and a brand new /v1/messages request goes back out.
10–50+ tool calls per task
Where it runs
- hw
- Your laptop CPU again — Node.js process spawns subshells, reads/writes files on local SSD, then re-opens an HTTPS connection
- net
- loopback while running the tool · then back out across the public internet for the next request
- site
- your desk → Cloudflare PoP → AWS/GCP region → GPU node — same loop, ~22 times for our task
event: content_block_start
data: {"type":"content_block_start","index":1,
"content_block":{"type":"tool_use","id":"toolu_01XYZ",
"name":"Read","input":{}}}
event: content_block_delta
data: {"delta":{"type":"input_json_delta",
"partial_json":"{\"file_path\":\"app/glob"}}
event: content_block_delta
data: {"delta":{"type":"input_json_delta",
"partial_json":"als.css\"}"}}
event: message_delta
data: {"delta":{"stop_reason":"tool_use"}}
────────── STREAM CLOSED ──────────
# Claude Code parses the tool call, runs Read("app/globals.css"),
# appends the result, and fires a NEW request:
POST /v1/messages
{
"messages": [
...prior 50K tokens...,
{ "role": "assistant", "content": [{ "type": "tool_use", ... }] },
{ "role": "user",
"content": [{ "type": "tool_result",
"tool_use_id": "toolu_01XYZ",
"content": "@tailwind base;\n@tailwind ..." }] }
]
}
→ back to step 03. The prefix is still cached. Loop.Each turn is a stateless forward pass. The illusion of continuous reasoning is the loop, not the model. Seven file edits and four minutes of wall-clock time later, the conversation history has grown by ~80K tokens, the model has been called ~22 times, and the prompt-cache hit rate stayed above 88%.
For a chat request, station 10 is the end. For an agent, it's the start of a loop. When the model emits a tool_use block (e.g. Read("app/globals.css")), the stream closes with stop_reason: tool_use. Claude Code parses the tool call, runs it locally, appends the result to the conversation, and fires a brand new /v1/messages request — back to station 3 (cache hit, route to the same replica), through all 80 layers again, sample a few more tokens, possibly emit another tool call, repeat.
For our four-word dark-mode prompt: ~22 round-trips through the entire eleven-station stack, ~80K tokens of conversation grown over four minutes, ~88% cache hit rate the whole way. Seven files modified, tests passing. The model itself never "ran for four minutes" — it ran for twenty-two forward passes of a few hundred milliseconds each, and the loop in your terminal is what made the time feel continuous.
That's the full trace. The rest of this section is the deeper math behind stations 5 (embedding), 7 (the transformer block), 8 (the KV cache), and the serving techniques that make all of it economic — read on for the numbers an investor needs; skip ahead to The demand side if the trace was enough.
Embeddings — what happens when a token first hits the GPU
Before attention, before MLPs, every token takes a trip through the embedding matrix. The tokenizer produces an integer ID; the embedding layer looks up row ID of a vocab_size × hidden_size table and returns a vector. That vector is the token's entire presence on the GPU from that point on — everything downstream is math on these.
The matrix is bigger than people intuit. Llama-3 70B: vocab_size = 128,256, hidden_size = 8,192. At BF16 that's 128,256 × 8,192 × 2 bytes ≈ 2.10 GB. Because Llama-3 does not tie input and output embeddings, a second matrix of identical size (lm_head) sits at the end of the stack — ~4 GB combined dedicated to embedding lookup and unembedding projection. Under tensor parallelism this gets sharded along the vocab dimension (Megatron's VocabParallelEmbedding): at TP=8, each rank holds 16,032 vocab rows × 8,192 cols and an all-reduce (forward embedding) or all-gather (final lm_head) sits on the critical path.
Tying matters more than people think. The "tied embeddings" trick (share weights between input embedding and output projection) was formalized by Press & Wolf (2017). The 2026 pattern across open-weights models: large models untie, with one important exception. Llama-3/3.1, DeepSeek-V3, Qwen 2.5 at 7B+, GPT-OSS 120B — all untied (verified from their config.json files). The exception is Gemma 2, which ties even at 27B — because Gemma's vocab is 256,000 tokens (roughly 2× Llama), making an untied lm_head disproportionately expensive. Tying saves ~2–8 GB per model depending on size. For inference servers juggling many models per GPU fleet, that saving adds up.
The lm_head cost at decode time is model-dependent. For dense Llama-3 70B it's ~1 billion FLOPs per output token — roughly 2% of the per-step decode compute dominated by the 80 layers of attention and MLP. Small number, not interesting. For MoE models, where only ~5% of parameters activate per token (GPT-OSS 120B: 117B total, ~5B active), lm_head can become a visible fraction of decode work because the active body shrank. For models with very large vocabs (Gemma 2's 256K) combined with small bodies, the same holds. "lm_head is a meaningful decode cost" is true in the MoE + large-vocab regime; it is not true universally.
Positional encoding is where long-context extension actually happens. Modern models use Rotary Position Embedding (RoPE) rather than learned position embeddings — positional information is injected at each attention layer by rotating the Q and K vectors as a function of position. The point of RoPE for inference is that a model trained at 8K context can be extended to 128K or 200K without retraining from scratch — you rescale the rotation frequencies. Three approaches compete:
- NTK-aware scaling (zero-shot; no training required) — originated in a r/LocalLLaMA community post rather than a paper, later formalized. Works with some quality degradation at the high end.
- YaRN (Peng et al., 2023) — the published SOTA for RoPE extension. Per the paper: matches SOTA quality after fine-tuning on under 0.1% of original pretraining tokens, 10× fewer tokens than prior methods. Not training-free — a common misconception. Used by DeepSeek-V3 and GPT-OSS 120B.
- Meta's "llama3" scaling for Llama-3.1 128K — a custom frequency-dependent scheme (low-frequency components scaled by factor 8, high-frequency left alone, smooth interpolation between). Shipped with ~800B tokens of gradual continued pre-training in 6 stages, not a fine-tune.
The investment-relevant fact: long context is a software extension, not an architectural rebuild. A lab can ship a 200K-context variant of a trained model by burning under 1% of original pretraining compute on RoPE scaling. This is why context windows keep growing faster than model generations do, and why the KV cache sized below — which scales linearly with context length — dominates HBM for production serving.
The KV cache — where long context actually costs
The KV cache stores the K and V projections for every past token so decode doesn't recompute them. The size formula is exact: KV_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × batch × dtype_bytes (the leading 2 is because you cache both K and V; n_kv_heads is the post-GQA count, not query heads). For Llama-3 70B (80 layers, 8 KV heads, head_dim 128, BF16): per-token cache is 2 × 80 × 8 × 128 × 2 = 327,680 bytes ≈ 320 KB/token. At 128K context, one user: ~40 GB. The weights themselves take ~140 GB at BF16. On an 80 GB H100, one concurrent long-context user eats the remaining headroom after weight sharding; a realistic long-context 70B rig is 4–8 GPUs per replica, and the deciding factor is KV, not weights. A hypothetical MHA variant (n_kv_heads = 64 query heads) would demand ~2.5 MB/token — 335 GB at 128K, larger than the weights. GQA is the reason the model is deployable at all.
This is why:
- DeepSeek's MLA (Multi-head Latent Attention) is as important as their MoE. MLA projects K and V into a shared low-rank latent (DeepSeek-V3 uses d_c=512 against a 16,384-wide effective attention space) and reconstructs at attention time. Per-token cache: ~70 KB, vs ~320 KB on a Llama-3-style GQA-8 baseline at comparable dimensions. The DeepSeek-V2 paper reports ~14× compression against MHA — but MHA is the wrong comparison class since every frontier model uses GQA. The honest framing is ~4× smaller than frontier GQA at matched quality. The tradeoff: MLA's up-projections burn extra compute per decode step and FlashAttention doesn't support the MLA shape yet, which is why adoption outside DeepSeek has been slow despite ACL 2025 work showing MLA can be retrofitted via distillation.
- Anthropic's prompt-caching discount effectively prices KV-cache residency. Cached input token reads bill at ~10% of base input pricing; cache writes bill at ~125%. Working it through for a 20-turn Claude Code session with a 50K-token stable prefix: turn 1 pays 1.25× on 50K (a write), turns 2–20 pay 0.1× on 50K (reads), plus each turn's small delta at full rate. Amortized input rate drops from $3/M to roughly $0.50/M — the ~6× effective reduction that turns agents from uneconomic to routine. The gotcha: the 1.25× write premium means any inadvertent cache invalidation (a timestamp in the system prompt, tool-result ordering changes, UUID leaks) silently inflates cost 2–5× because you're writing instead of reading. Claude Code's 84–92% cache hit rate is the outcome of aggressive prefix-stability engineering, not a feature that works by default.
- Google's TurboQuant (Feb 2026) demonstrated lossy KV compression via random rotation plus precomputed Gaussian codebooks. That memory stocks (Micron, Western Digital, SanDisk) sold off meaningfully on publication is widely-observed but the exact daily move is worth checking before citing.
KV cache quantization is the other lever — and it's what production actually runs. MLA and TurboQuant are architectural; KV quantization is a post-hoc compression applied to the cache itself at serving time. The levers, in order of production maturity:
- INT8 KV cache — shipping in vLLM (
kv_cache_dtype="fp8_e4m3"or"int8") and TensorRT-LLM (--int8_kv_cachewith calibration). 2× smaller than BF16 with minimal quality degradation on standard benchmarks, per both frameworks' docs. Specific perplexity deltas depend on model and task — don't treat "sub-half-percent drift" as a universal figure. - FP8 KV cache — also in vLLM and TRT-LLM; requires Hopper / Ada / Blackwell. NVIDIA explicitly recommends FP8 over INT8 on Hopper-class hardware because FP8's dynamic range preserves activation outliers better than INT8's linear quantization.
- NVFP4 KV on Blackwell — SM100 (datacenter Blackwell) only today; NVIDIA's own blog reports 50% KV memory reduction versus FP8 at under 1% accuracy loss using ModelOpt offline calibration, currently paired with FP8 weights. Consumer Blackwell (SM120) support is still an open TRT-LLM issue. This is the lever Blackwell-era inference throughput wins depend on most.
- INT4 / 2-bit KV — research-grade. KIVI (Liu et al., ICML 2024) is the notable result: 2-bit asymmetric quantization with per-channel for keys and per-token for values, tuning-free, near-lossless on LongBench / GSM8K per the paper's tables. Not in production serving frameworks as a first-class feature as of this writing.
Another lever: don't cache what you don't need. Two classes of technique bound KV size by architecture or eviction rather than compression:
- Sliding-window attention (Mistral 7B, Jiang et al. 2023) restricts each layer's attention to a fixed window — window W=4096 in Mistral, information propagated to tokens outside the window via stacked layers. KV cache per sliding layer is O(W × layers_local) rather than O(seq × layers). Modern mixed-attention designs alternate local and global: Gemma 2 interleaves local (W=4096) and global layers; GPT-OSS 120B's config specifies a mix of
sliding_attention(W=128) andfull_attentionlayers. The design trades off global-context fidelity for a major KV reduction. - Attention sinks / streaming attention (Xiao et al., ICLR 2024) exploit the observation that the first few tokens of a sequence get outsized attention regardless of content — keeping those "sinks" and evicting middle tokens preserves generation quality far better than naïve windowing. Integrated in TensorRT-LLM and HuggingFace Transformers; vLLM declined to implement ("not planned" per their issue tracker). Whether frontier-lab production stacks use streaming-style eviction is not publicly disclosed.
PagedAttention works for the same reason DRAM controllers prefer reads from already-open rows, the same reason SSDs use super-pages striped across chips, and the same reason hard drives need defragmenting: at every layer of the memory hierarchy, the cost of an access is dominated by getting to the data, not reading it. Software that arranges related data spatially — same DRAM row, same flash page, same disk track — pays the move once and amortizes it across many reads. Most of what looks like architectural diversity below the application layer is one principle in different geometry.
Mixture of Experts — sparse activation and why it benchmarks differently
Every frontier model released since mid-2024 is a Mixture of Experts. DeepSeek V3 (December 2024) has 671B total parameters but only 37B activate on any given token. Llama 4 Maverick (April 2025) has 400B total / 17B active across 128 routed experts plus 1 shared expert. GPT-OSS 120B (OpenAI's open release) has ~117B total / 5.1B active across 128 experts. Qwen3-235B-A22B — the naming convention "A22B" = 22B activated — has 22B active over 128 experts. Only Anthropic hasn't confirmed Opus's architecture either way.
The mechanics: each transformer layer replaces its single feedforward MLP with N parallel experts plus a small router network. For every token, the router scores all experts, picks top-K, and only those K experts do the forward pass. DeepSeek-V3 uses top-8 of 256 routed experts plus 1 shared expert (9 total active per token). Llama 4 Maverick uses top-1 of 128 routed plus 1 shared (2 total). GPT-OSS 120B uses top-4. The Mixture of Experts Layer goes back to Shazeer et al. 2017 ("Outrageously Large Neural Networks," arXiv:1701.06538); Switch Transformer (Fedus, Zoph, Shazeer 2021, arXiv:2101.03961) simplified it to top-1 routing; Mixtral 8x7B (arXiv:2401.04088) brought the pattern to the open-weight frontier with top-2 of 8.
The asymmetry is the whole point and the whole problem. HBM capacity cost scales with total parameters — every expert has to be resident somewhere in the inference fleet because routing decisions are per-token and unpredictable. Compute cost scales with active parameters — only the K selected experts run per token. So a 671B DeepSeek V3 holds ~1.3 TB of weights in HBM at BF16 but does roughly the FLOP count of a 37B dense model per forward pass. This is great for tokens-per-second-per-watt and terrible for single-box deployability.
How experts actually get served: expert parallelism. You can't fit 256 experts on one GPU, so each GPU holds a subset of experts and every decode step involves an all-to-all communication — each token's router output tells the system which GPU has the expert it needs, and activations get shuffled across the pool. This is why NVL72 matters disproportionately for MoE: 72 GPUs inside a single NVLink domain mean the all-to-all traffic stays on 1.8 TB/s NVLink rather than crossing 400 Gbps InfiniBand. SGLang's large-scale EP deployment (12 nodes × 8 H100 = 96 H100s) serving DeepSeek-R1 reports 52.3K input tokens/sec and 22.3K output tokens/sec per node on 2,000-token prompts — a per-8-GPU-node figure, not a per-GPU figure. NVIDIA's own wide-EP benchmarks on GB300 NVL72 show SGLang on DeepSeek-R1 up to ~25× the H200 throughput at matched latency.
This is where the "~100× H100" InferenceMAX numbers come from. The headline wins SemiAnalysis publishes — Blackwell/Rubin-class racks multiplying H100 throughput on specific benchmarks — are mostly MoE scenarios where NVL72 scale-up lets the all-to-all stay inside copper. For dense non-MoE models (Llama-3 70B, Qwen 72B), the generational uplift on the same benchmark collapses to a much smaller range. When a vendor reports a headline ratio, the first question to ask is: which model, which interactivity point, which precision, MoE or dense?
Two downstream consequences worth noting. First, the embedding math from the previous section looks different on MoE: when the active body is 5.1B params (GPT-OSS 120B) and the vocab is 201,088, the lm_head projection becomes a larger fraction of per-step decode compute than in a dense 70B — not 2%, closer to double-digit percentages depending on shape. Second, DeepSeek's auxiliary-loss-free load balancing (a bias-adjusted routing score rather than a training-time balance penalty) is the current state-of-the-art for keeping experts evenly loaded at inference time; expert "hot spots" where one expert gets disproportionate traffic have been the standard production headache.
The investment-relevant collapse of this section: MoE is why hardware benchmarks require unpacking, why NVLink scale-up matters more than raw FLOPs, and why HBM capacity — not just bandwidth — is one of the binding constraints.
The serving stack — why tokens got cheaper
A large fraction of the token-price decline over the last three years is software, not silicon. The load-bearing techniques:
- Continuous batching / iteration-level scheduling (Orca, OSDI '22). The scheduling granularity is one decode step, not one request — every iteration, new requests admit into free batch slots, finished requests evict immediately, and prefill for newcomers interleaves with decode for everyone else. The non-trivial engineering is selective batching: attention has to be per-request (each holds its own KV), but MLP matmuls cross-request batch because they're all multiplying the same weight matrix by different activations. Orca reported 36.9× throughput at matched p50 latency versus pipeline-parallel FasterTransformer on GPT-3 175B.
- PagedAttention (vLLM, SOSP '23). KV cache as OS virtual memory — chunked into fixed-size blocks (16 tokens typical, ~80 KB each on Llama-70B), each request holds a block table mapping logical positions to physical blocks, sequences grow by grabbing from a global pool, shared prefixes point to the same physical blocks with refcounts and copy-on-write. External fragmentation goes to zero by construction. The vLLM paper reports KV waste falling from 60–80% to under 4% and 2–4× throughput over prior systems on realistic traces. The attention kernel has to do a block-table lookup per K/V read — FlashAttention-3 and FlashInfer now ship native paged-KV support with
<5%overhead vs contiguous. - Speculative decoding. A cheap draft model proposes k candidate tokens; the target model verifies them in one parallel forward pass (essentially the same cost as one decode step, because decode is bandwidth-bound). Expected speedup is
(1 − α^(k+1))/(1 − α)where α is per-token acceptance rate — k=5, α=0.7 gives ~2.4×. Costs HBM (draft weights + parallel KV) and falls apart above ~batch 32 when the target's decode stops being bandwidth-bound. EAGLE-3 (NeurIPS '25) sidesteps the memory cost by reusing the target's own intermediate features; it's become the production default in stacks that ship spec-decoding.
- Disaggregated serving (DistServe, Splitwise, Mooncake). Run prefill and decode on separate pools; ship KV cache between them. The two SLAs this optimizes: TTFT (time to first token) — how long the user waits before streaming starts, dominated by prefill — and TPOT (time per output token) — the inter-token cadence the user experiences, dominated by decode. Chat SLAs want both small simultaneously; colocating prefill and decode on the same GPU pool forces a compromise. A 2K-token Llama-70B prefill produces ~620 MB of cache that has to move to the decode GPU — roughly 12 ms on 400 Gbps RDMA, mostly hidden behind the first decode step; roughly 50 ms on 100 Gbps, which does hurt TTFT. This is the concrete mechanism linking inference architecture to optical-networking demand, not just "bigger cluster more fiber." DistServe reports 2–3.4× goodput on chat and code-completion workloads, where goodput = throughput meeting both TTFT and TPOT SLAs, not raw tokens/sec.
- FP4 on Blackwell. Dropping from FP16 to FP4 halves the bytes per weight twice over, directly buying throughput on a memory-bound workload. The "up to 100×" InferenceMAX number is true but narrow: it's specifically DeepSeek-R1 MoE on GB300 NVL72 at 116 tok/s/user interactivity, against a best-configured H100 FP8 disagg baseline. Decomposed roughly: ~2× from FP8→FP4, ~3–4× from H100→B200 generational uplift, ~3× from 72-GPU NVLink scale-up letting MoE all-to-all stay off InfiniBand, ~2× from MTP, plus superlinear scaling once the whole model fits in one NVLink domain. For dense non-MoE models (Llama-3 70B, Qwen2-72B), the gen-over-gen win collapses to ~3–5×. "100×" is a specific-benchmark claim, not a generic uplift.
These headline numbers don't stack multiplicatively — Orca's 36.9× is against FasterTransformer's static-batching baseline; vLLM's 23× is against HuggingFace TGI v0.9 that already had basic continuous batching. But the cumulative effect is visible in actual pricing across three years, and it's the cleanest evidence that efficiency is a real phenomenon, not a narrative.
The quality-adjusted price curve is the single most rigorously documented number in this space:
- Stanford HAI's 2025 AI Index (Chapter 1) reports that the inference cost for GPT-3.5-level quality fell from $20 per million tokens (November 2022) to $0.07 per million tokens (October 2024, via Gemini 1.5 Flash-8B) — a ~280× drop in 18 months.
- Epoch AI's underlying dataset (the source Stanford draws from) characterizes the full deflation curve as "prices have fallen by between 9× and 900× per year depending on the performance milestone" — the wide range reflects that commoditizing an existing capability is much cheaper than extending the frontier.
- a16z's "LLMflation" post (Guido Appenzeller, November 2024) frames the same curve as "10× per year for 3 years, 1,000× cumulative" for GPT-3-quality output going from $60/M (2021) to $0.06/M (2024).
Frontier model sticker prices tell a different story — they've held remarkably flat:
- Claude Sonnet has been $3 input / $15 output per million tokens since Claude 3 Sonnet launched March 2024 — through Sonnet 3.5, Sonnet 4, Sonnet 4.5, and Sonnet 4.6, a span of roughly 25 months with no headline price cut.
- Claude Opus dropped from $15 / $75 (Opus 3, March 2024) to $5 / $25 (Opus 4 through 4.7 in 2025–April 2026) — a 67% cut at both ends. A subtlety flagged by trade-press analysis: Opus 4.7 shipped with a new tokenizer that can emit up to 35% more tokens for the same text, so "unchanged $/M" does not equal "unchanged $/request."
- OpenAI o3 launched April 2025 at $10 input / $40 output, then dropped 80% in mid-2025 to $2 / $8 — the steepest single cut in the o-series. GPT-5.4 sits around $2.50 input / $15 output per aggregator tracking.
- DeepSeek R1 anchors the floor at $0.55 input / $2.19 output — roughly 27× cheaper than o1's launch pricing by arithmetic, though only ~3–4× against the reduced o3. DeepSeek's off-peak 75% discount ran from February to September 2025 before being ended.
- Sam Altman has publicly stated (February 2025) that "the cost to use a given level of AI falls about 10× every 12 months" and that price-per-token fell ~150× from GPT-4 (early 2023) to GPT-4o (mid-2024). A vendor framing, but it matches the independent data.
The open-source tier is where the floor-breaking happens most aggressively. DeepInfra lists Llama 3.3 70B output at $0.15–$0.36/M as of April 2026, down from ~$0.88 in late 2024 across Together/Fireworks — roughly 5–6× at the commodity floor in 18 months.
The pattern: quality-adjusted price for any given capability tier falls ~10× per year. Frontier list prices at the top of the stack held flat for Sonnet (25 months at $3/$15), cut ~67% for Opus over two years, and cut 80% for o-series reasoning on a single step. The efficiency is real; list prices hold because demand keeps absorbing it. That's the Jevons dynamic in one paragraph.
The demand side — five multipliers, one compounding curve
The 2023 version of inference was a single ChatGPT turn consuming on the order of a thousand tokens. The 2026 version is an agent. Five independent multipliers now stack on top of every task:
Demand Compounding
Five independent multipliers stack on every inference task. Each grew on its own curve from 2023–2026.
o1/o3/R1/Gemini Thinking burn hidden 'thinking tokens' billed at output rates. Gemini 2.5 Flash's thinking toggle lifts output pricing from $0.60/M → $3.50/M — a 6× step from one flag.
→ o3-high on ARC-AGI: ~$30K per task
Every tool call is a round-trip. Claude Code payload is 45K tokens before the user types a word. Anthropic's own number: agents use ~4× more tokens than chat.
→ 30-min session ≈ 100K tokens
Parent spawns sub-agents, each with its own 200K context and tool loops. Anthropic published: multi-agent systems use ~15× more tokens than chat. Token usage explains 80% of BrowseComp variance.
→ One research query ≈ 400K tokens
Each MCP tool definition runs 280–320 tokens. Five servers with 50 tools consume 30–60K tokens — up to 30% of the context window — before any user prompt.
→ Fixed cost per session
Claude Sonnet 4.5 ran 30+ hours autonomously, producing 11K lines of code in one session. A continuous loop at 25 calls/min × 4K tokens = ~180M tokens per 30-hour run.
→ Devin ACU pricing: $8–9/hour
Stacked
~1,000× per-task demand growth in 24 months. Jensen at GTC 2026: 10,000× workload × 100× usage = 1,000,000× total demand.
Reasoning tokens — the multiplier that justifies everything else
The top multiplier deserves its own walk-through because it's the demand shock that turned inference from a chat workload into a compute-intensive one.
Before September 2024, when you sent a prompt to an LLM, the model produced a response directly. No hidden computation. The output tokens you paid for were the tokens you saw. Then OpenAI announced o1-preview on September 12, 2024, and the o1 blog post made an explicit claim that reframed the whole economics: "o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)."
Thinking in this sense is literal. The model generates a long internal reasoning chain — hidden from the user but counted against your token budget and billed at output rates. OpenAI's reasoning documentation states explicitly that reasoning tokens "are not visible via the API… they still occupy space in the model's context window and are billed as output tokens," and recommends reserving at least 25,000 tokens of headroom for reasoning plus outputs. Per the same guide, a simple query burns a few hundred reasoning tokens; a complex planning task burns "tens of thousands."
The pricing tells the rest. At o1's December 5, 2024 API launch: $15 / $60 per million tokens for input / output. Gemini 2.5 Flash makes the split visible in real time — the same model bills output at $0.60/M with thinking off and $3.50/M with thinking on, a ~6× step from one flag, with a configurable thinkingBudget from 0 to 24,576 tokens per turn. DeepSeek R1 (January 20, 2025, arXiv:2501.12948) was the cost-side answer: $0.55 input / $2.19 output per million tokens, with reasoning_content returned in an explicit <think> tag — the first major open-weight reasoning model. At launch that's roughly 27× cheaper input and 27× cheaper output than o1 by arithmetic, though OpenAI's mid-2025 o3 price cut to $2 / $8 narrowed the gap substantially.
The benchmark that made this concrete: ARC-AGI. In December 2024, ARC Prize reported OpenAI o3 achieving 75.7% at low-compute settings and 87.5% at high-compute settings — at a reported ~$20/task and ~$3,000/task respectively, with the high-compute run using 172× more tokens than the low. In April 2025, ARC Prize and Toby Ord revised the cost estimates upward by roughly 10×, to ~$200/task low-compute and ~$30,000/task high-compute. o3-high was subsequently dropped from the ARC Prize main chart because it exceeded the $10,000/task eligibility cap. A single hard reasoning task can burn billions of tokens of hidden thinking.
METR measured the trajectory. Their March 2025 paper introduced a metric — "50% time horizon," the length of task (measured in human-minutes) an AI completes with 50% success rate — and found it doubling roughly every 7 months from 2019 to 2025 across 11 frontier models. The January 2026 update ("Time Horizon 1.1") revised the post-2023 regime to 130.8 days — about 4.3 months, with the task suite expanded from 170 to 228 tasks and 8-hour-plus tasks doubled from 14 to 31. Horizon is roughly proportional to tokens per task — so task-level compute demand is doubling every ~4 months just from horizon extension, before any growth in users or sessions.
Why this matters for hardware. Reasoning models don't produce more input tokens; they produce vastly more output tokens. Output-heavy workloads are decode-dominated, and decode is memory-bandwidth-bound — which means a reasoning-heavy inference mix shifts demand hard toward parts with high HBM bandwidth per dollar. The conventional wisdom that "reasoning = decode-bound" is widely repeated in inference-economics commentary but not rigorously quantified in any peer-reviewed source I can cite; frame it as directional. The direct economic evidence is cleaner: Claude Opus 4.7 launched April 16, 2026 at $5/$25 (down from Opus 3's $15/$75) and GPT-5.4 lists around $2.50/$15 — reasoning-tier prices are falling and reasoning-tier token consumption per task is rising. The compounding is why inference is on track to dominate training compute within this decade.
The single most-quoted total-demand figure comes from Jensen Huang's GTC 2026 keynote: ~10,000× per-workload growth × ~100× usage = ~1,000,000× total AI compute demand in two years. This is a vendor CEO speaking at his own product event; treat the exact numbers as Nvidia's framing, not a neutral measurement.
The multipliers are real and several are company-published:
- Reasoning models invert the I/O ratio. Gemini 2.5 Flash exposes a "thinking budget" that, when enabled, moves output pricing from $0.60/M → $3.50/M — a material price step from a single flag.
- ARC-AGI-1 on o3-high was reported at roughly $3,000 per task in the initial ARC Prize breakdown, later revised upward by independent compute analyses. Estimates above $20K/task for o3-high have circulated; the exact number depends on revised compute counting.
- Anthropic's own multi-agent system uses roughly 15× the tokens of chat, with token count explaining ~80% of BrowseComp performance variance.
- MCP tool definitions run ~280–320 tokens each; a rack of five MCP servers with ~50 tools consumes 30–60K tokens before any user prompt.
- Long-horizon runtime: Anthropic's Claude Sonnet 4.5 launch post documented a ~30-hour continuous autonomous coding run producing ~11,000 lines of code in a single session.
Efficiency has not kept up. Over the same period:
- Anthropic's reported ARR has grown rapidly. End of 2025: ~$9B. March 2026 (Dylan on Dwarkesh): ~$20B. April 2026 (Dylan on Invest Like The Best): ~$35–40B with an asserted ~$10B of incremental monthly ARR. The $20B+ trajectory is corroborated by multiple trade outlets; the specific monthly-add figure is analyst assertion only.
- OpenAI API throughput: 6+ billion tokens per minute, per Sam Altman at DevDay 2025 — roughly 8.6 trillion tokens per day.
- Claude Code adoption: SemiAnalysis reports ~9.7% of daily GitHub commits contain Claude Code signatures as of March 2026, targeting 20% by EOY. SA publishes this in their free research; the methodology is theirs.
- H100 one-year rental contract price rose from ~$1.70/hr (Oct 2025) to ~$2.35/hr (March 2026) per SemiAnalysis's rental tracker.
Patel frames this with what he calls the Parkinson dynamic: each HBM capacity and bandwidth bump gets absorbed by designers expanding parameter counts, context lengths, and KV-cache footprints. Every efficiency win becomes next year's baseline. The broader economics history literature calls this Jevons' paradox.
Jevons is a regularity, not a law. It holds when three conditions stack: demand elasticity > 1 (cheaper means more total spend, not less), capability keeps compounding with compute (METR's task-horizon metric currently doubles every ~4 months), and the latent-use-case reservoir stays large (developer coding is maybe 5% saturated and is one of 20+ knowledge-work verticals). All three are intact right now — tokens got ~280× cheaper while aggregate spending rose ~100,000×, an implied log-log elasticity near 3. The dynamic breaks if horizon-doubling stalls past 12 months, if capex-to-ARR ratios invert, or if an efficiency gain large enough to outrun downstream capital accumulation ships all at once. All three are measurable; all three are on the catalyst list below.
The lab math — this isn't projection, it's arithmetic
Patel's framing on Dwarkesh converts lab revenue into capacity using a rule-of-thumb conversion rate. His stated number: ~$10 billion per gigawatt per year of rental-equivalent compute spend. Against that:
- Anthropic current fleet: roughly 2.0–2.5 GW, targeting 5–6 GW by EOY 2026 across own capacity plus Bedrock/Vertex/Foundry, and roughly 10 GW by EOY 2027.
- OpenAI end-of-2026: roughly 6 GW on the same analyst framework; EOY 2027 ~10 GW.
- Global AI deployed today: ~20 GW per the Dwarkesh framing.
Even discounting each of these by 30% for analyst framing, the direction is unambiguous: just the two top frontier labs will want 30+ GW by end of 2027, against a roughly 20 GW global AI fleet today. Other model providers (Google, xAI, Meta) and sovereign AI buildouts (G42, HUMAIN, IndiaAI, Scaleway, Nscale, Mistral) stack on top. Google's Gemini-related ARR in Q4 2025 was described by Patel as going from near-zero to ~$5B in a single quarter; Google has not independently disclosed a comparable number.
Patel's cleanest investable frame is what he calls the X − 1 problem: frontier labs plan capacity for the demand they model (X), while each layer of the supply chain plans conservatively (X − 1 or even X ÷ 2). The gap between what labs want and what suppliers are building is where margin accrues at the binding layer.
Six-week update — Patel on Invest Like The Best (April 23, 2026)
- Anthropic ARR: described as ~$35–40B, headed to $40–45B; monthly incremental ARR ~$10B. Up from $4–6B/month framing in the March episode.
- Anthropic gross margin: stated as a floor of ~72%, derived by Patel from the assumption that all of Anthropic's incremental compute went to inference. If some went to R&D (which Mythos and Opus 4.7 suggest is true), the actual inference GM is lower. Patel's opening anchor was a ~35% figure from a leaked early-2026 Anthropic funding doc.
- Hopper useful life: described as 7–8 years based on observed re-signings of 3–4-year-old H100 and A100 clusters. Economic life, not semiconductor life.
- Mythos: described by Patel as Anthropic's next-frontier model, internally available since February 2026, not publicly released as of April 23, and deployed selectively (notably to cybersecurity customers). Token cost described as 5–10× Opus 4.7. None of this is verifiable without Anthropic confirming; Anthropic has neither confirmed nor denied publicly as of this writing.
- Model "hoarding" — Patel's predicted trajectory of narrowing frontier-model distribution to high-value enterprise customers. This is a prediction, not an observation.
- CPUs sold out — Patel's assertion that (1) RL environments and (2) serving agent-generated code are creating a new CPU demand leg. Consistent with hyperscaler capex guidance rising but not independently quantified in this episode.
- TSMC 2028 capex ~$100B — Patel's own forecast, up from TSMC's 2026 guided ~$57B. TSMC has not issued 2028 capex guidance.
- "All tier labs sold out of tokens" — Patel's framing of demand exceeding supply even at the tier-2 and tier-3 model level. Not independently confirmed; consistent with API availability reports but not rigorous.
- Public-backlash prediction: large-scale protests within 3 months — Patel's explicit forecast. The specific Sam Altman home-incident claims ("Molotov twice in two weeks") are unverified against credible security reporting and should be treated as hearsay until confirmed.
- Phantom GDP — a framework Patel attributes to his own team's economist. Useful as a mental model but no external academic literature yet.
The pattern is unambiguous: a well-positioned analyst is revising his own numbers in one direction only. But "well-positioned" is not the same as "audited." Six weeks from now we will have more data.
Not everyone uses it this way
My Claude Code workflow is one of four fundamentally different ways organizations consume LLM inference. Each creates demand for the same scarce GPU hours — but the infrastructure between the user and the GPU looks radically different.
Four Ways to Consume Inference
Same scarce GPU hours — radically different infrastructure paths
Consumer product
Claude.ai, ChatGPT
Inference: Anthropic / OpenAI cloud
Tools: None (chat only)
Agentic developer tool
Claude Code, Codex
Inference: Same cloud + local tool loop
Tools: File ops, bash, git, browser
Enterprise backend API
Stripe on Bedrock, Ramp product features
Inference: AWS VPC (never hits public internet)
Tools: Lambda, SageMaker, custom
Internal productivity agent
Ramp Inspect — 30% of merged PRs
Inference: Modal sandbox + LLM backend
Tools: Full dev env: DB, CI/CD, feature flags
Every mode creates demand for the same scarce resources — GPU hours, HBM bandwidth, optical networking, fab capacity, EUV tools. The difference is how many layers of infrastructure sit between the user and the GPU.
Direct consumer products — Claude.ai, ChatGPT — are the simplest path. You type, the model responds.
Agentic developer tools — Claude Code, Codex, Cursor — add the tool-use loop. The LLM still runs in the cloud; tools execute locally.
Enterprise backend APIs — AWS Bedrock, Google Vertex, Azure OpenAI — are how products like Stripe and Ramp embed AI behind features. Requests flow through a VPC from app server to inference endpoint, never hitting the public internet.
Internal productivity agents are the most interesting category. Ramp built an agent called Inspect that reportedly writes a material share of their merged PRs (InfoQ, Jan 2026 reported ~30%). Engineers invoke it in Slack; it spins up a sandboxed dev environment on Modal, clones the repo, writes code, runs tests, and pushes a PR.
All four modes create demand for the same scarce physical resources — GPU hours, HBM bandwidth, optical networking, CoWoS capacity, EUV passes. The consumption surface is expanding in every direction simultaneously.
What about robots and voice?
Two questions every reader has at this point: doesn't the humanoid wave (Optimus, Figure, 1X) and the LLM-voice transition (Siri, Alexa+, ChatGPT Voice) compound on top of all this? Yes — but the bytes go to different places than the intuition suggests, and that matters for which layers in this stack catch the demand.
Robotics inference is overwhelmingly onboard, not in the datacenter. The four constraints — closed-loop control latency (10–100 ms), battery power, no-Wi-Fi reliability, and per-query OPEX at fleet scale — force System-1 motor control onto edge silicon, and 2026's production VLAs put System-2 reasoning there too. Figure's Helix runs ~7B + 80M onboard on embedded GPUs. NVIDIA's GR00T N1.7 targets Jetson AGX Thor (128 GB LPDDR5X, not HBM). Tesla AI5 carries up to 192 GB LPDDR5X for Optimus + FSD. A million Optimus units does not equal a million more H100s — it equals a million Tesla AI5 chips fabbed at TSMC + Samsung and a lot more LPDDR, which competes with HBM for the same big-3 wafer pool rather than directly extending HBM demand.
The HBM tailwind from robotics lives on the training side. Tesla's Cortex 2.0 ramps to 500 MW of NVIDIA GPUs through mid-2026; NVIDIA's published GR00T-N1-2B pretraining ran ~50K H100-hours per generation; Cosmos world-model training is comparable to frontier video-gen. Real, recurring, HBM-consuming — but it scales with R&D intensity and frontier model size, not 1:1 with deployed-robot count. A fleet of one million humanoids running the same shipped checkpoint adds ~zero training compute per unit.
Voice splits three ways, and only two of them route to the original beneficiary stack. Alexa+ runs Anthropic Claude on >1M Trainium2 chips at AWS via Project Rainier — real cloud LLM demand, but it routes through Anthropic + Marvell + TSMC, not NVIDIA. ChatGPT Realtime is the cleanest extension of the original thesis: native audio-to-audio at ~16 audio tokens/sec/user, 5–10× the per-minute token volume of text chat, on a 900M-WAU base — directly NVIDIA/HBM/Azure-positive and growing fast. Apple Intelligence routes most queries to a 3B on-device model and escalates to Private Cloud Compute on Apple-Silicon servers (M5 Ultra, Houston) — a real net-new datacenter footprint, but on Apple's unified-LPDDR architecture, not NVIDIA HBM. So Apple is a TSMC-N3P/N2 tailwind and an HBM-neutral story; the ChatGPT-fallback path inside Siri is the only piece that touches the original thesis directly.
The cleanest 10× extension of the five-multiplier compounding model into a new modality isn't robots — it's always-on background voice agents: a Personal-Context Siri or Anthropic-Computer-Use Alexa that runs persistently rather than reactively, reproducing the same reasoning + tool-use + multi-agent + MCP-context + long-horizon stack on a billion phones. No assistant ships this at scale yet. Apple's iOS 27 (September 2026) is the live catalyst to watch; Alexa+ feature expansion is the secondary one. The robotics version of the same catalyst — VLAs scaling past 100B parameters and being forced to cloud System 2 — is plausibly 2027–2028. Until either ships, treat embodied AI as a TSMC + LPDDR + training-amortized HBM story, not a per-unit datacenter inference story.
The thesis
Every layer of the AI supply chain has a different lead time. As each shorter-lead-time problem gets solved, the constraint rotates down to whatever takes longer to scale. Margin flows to whoever owns the layer that cannot be scaled on demand-signal timelines.
Here are the numbers that matter right now:
Hyperscaler capex 2026
$600B+
Big-4 combined. ~$1T full supply chain. 30% of this now flows to memory — roughly $180B — with Nvidia margin stacked inside.
H100 rental 6-mo move
+40%
$1.70/hr (Oct 2025) → $2.35/hr (Mar 2026). Labs signing 2–3 yr deals at $2.40. GPUs are appreciating, not depreciating — direct refutation of the Burry short.
Demand compounding in 2 yrs
1,000,000×
Jensen, GTC 2026: per-workload demand up 10,000×; usage up 100×; total up 1M×. Efficiency at 10×/year loses the race.
ASML 2030 hard ceiling
~200 GW
~100 tools/year max, ~700 cumulative fleet, 3.5 tools per GW of Rubin. Altman's 52 GW/yr target = 25% of all global EUV capacity.
These are a mix of primary-disclosed and analyst-derived figures. Hyperscaler capex is committed capital (earnings disclosures). Published H100 rental prices come from SemiAnalysis's tracker. ASML's tool-production cadence is public per their disclosures, but the implied "200 GW ceiling" embeds analyst assumptions about how many tools you need per gigawatt.
The bottleneck keeps moving — and that's the investment thesis
The timeline isn't just describing what's scarce. It's describing where margin flows over time, and a serious investor rotates positions with it.
Bottleneck Rotation
The constraint moves up the stack each year, and each subsequent layer has a longer lead time. Capital rotates with it.
| Window | Binding constraint | Lead time | Who owns it | Status |
|---|---|---|---|---|
| 2023–24 | CoWoS packaging | Months | TSMC (captured) | Played |
| 2024–25 | Power + data centers | 1–2 years | VRT · CRWV · WULF / CIFR | Played |
| 2026–27 | HBM + N3 logic wafers | 2–3 years | SK Hynix · MU · TSMC · NVDA | Live |
| 2027–28 | HBM4 + N2 wafers + packaging | 2–3 years | SK Hynix · TSMC · NVDA · AVGO | Live |
| 2028–30 | ASML EUV tool throughput | 3–5 years | ASML · Zeiss · Cymer | Entry window |
The key move is matching position duration to bottleneck timing. Neoclouds and power plays are 2026 duration. Memory and foundry are 2026–28. ASML is the longest-hold position precisely because the constraint binds last. Its 2030 fleet ceiling is physics — ASML's disclosed EUV shipment cadence implies a cumulative fleet on the order of several hundred tools by 2030. The "3.5 tools per GW of Rubin" figure used to derive an implied 200-GW-per-year ceiling is Patel's own AI Accelerator Model output; the input (tool shipment cadence) is primary, the conversion is analyst.
Per Patel, the consumer-to-AI capacity shift in the semiconductor supply chain is now tapped out: Nvidia is the largest customer at both TSMC and SK Hynix. From here, scaling means new fab capacity rather than reallocating existing lines — which makes ASML's cadence the effective ceiling.
Following the hardware down
We've traced the software path — prefill to decode to tool loop and back. Now let's trace the physical path.
The data center — racks, optics, and the neocloud model
What an AI data center actually looks like
Power density. A traditional enterprise rack draws 5–10 kW. An AI GPU rack draws 40–100+ kW. An Nvidia DGX GB200 NVL72 cabinet (72 Blackwell GPUs in a single liquid-cooled rack) draws ~120 kW.
Cooling. Air cooling fails above ~30 kW per rack. AI data centers require direct liquid cooling. Nvidia's GB200 NVL72 is liquid-cooled by design.
Scale. Meta's announced Prometheus and Hyperion clusters target several-hundred-thousand GPU scale for next-generation training. At 700W per Blackwell, a 100K cluster draws 70 MW from GPUs alone; total facility power approaches 100–150 MW.
AI Data Center Architecture
The networking hierarchy — bandwidth drops, distance grows at each layer
On-board traces / NVSwitch
1.8 TB/s
< 1m
Copper DAC / Active optical cable
400-800 Gbps/port
1-5m
Active optical cables
51.2 Tbps (spine)
5-100m
Single-mode fiber + pluggable transceivers
400-800 Gbps/fiber
100m-80km
Fiber + amplifiers + coherent DSP
Tbps aggregate (WDM)
80-10,000+ km
Power delivery
Cooling evolution
Hot/cold aisle, CRACs. Insufficient above ~30 kW/rack.
Cold plates on GPUs/CPUs, liquid loops. Handles 40-100+ kW/rack. Required for H100/B200.
Servers submerged in dielectric fluid. Best thermal performance but complex maintenance.
The networking hierarchy
Intra-node: NVLink. Within a cabinet, Nvidia's NVLink 5.0 connects GPUs at 1.8 TB/s bidirectional. NVSwitch chips act as crossbar switches.
Intra-cluster: InfiniBand. Between servers, InfiniBand runs 400–800 Gbps per port. Nvidia acquired Mellanox in 2020 for $6.9B specifically for InfiniBand.
Rack-to-rack and beyond: optical. Beyond ~5 meters, copper can't hold signal. Active optical cables convert electrical to light via VCSEL or edge-emitting lasers, through fiber, back to electrical.
Between clusters: coherent optics. For campus or metro-scale, coherent transceivers use DSP and advanced modulation to push 400–800 Gbps per fiber pair.
Why optical is a bottleneck. Cluster all-reduce generates O(N) traffic, but potential paths scale O(N²). Network architects use dragonfly and rail-optimized topologies to manage it, but optical demand grows superlinearly with cluster size. Patel described optical transceivers on Dwarkesh as "more unreliable than the GPUs" — replugged across cluster lifetime, effectively consumables. Coherent and Lumentum make the parts; Situational Awareness LP's reported 13F shows Lumentum and Coherent as top-10 positions.
The neocloud business model
A "neocloud" is a cloud provider built specifically for GPU workloads — CoreWeave, Lambda, Together, Crusoe, Nebius. Acquire GPUs, build optimized data centers, rent to AI labs.
The economics Patel describes on Dwarkesh: H100 build-cost TCO around $1.40/hr amortized over five years, with some 2–3-year rental contracts signing at ~$2.40/hr with AI labs. That implies ~70% gross margin on hardware roughly two years into its five-year life. Both the build cost and the rental rate come from SemiAnalysis's paywalled AI Cloud TCO Model; we're trusting their aggregation.
GPUs that appreciate
The conventional bear case (associated with Michael Burry's summer 2024 semiconductor short position) says GPUs depreciate as better chips arrive. The counter-argument Patel and SemiAnalysis have articulated: in a supply-constrained market, price is set by value-extractable-today, not by replacement cost.
The data point that supports the appreciation frame: SA's GPU rental tracker shows H100 1-yr contract pricing rising ~40% between October 2025 and March 2026. If supply were overbuilt, prices would be falling.
Why the crunch favors frontier models
The tempting frame here is Alchian-Allen: adding a fixed per-unit cost to two goods narrows their ratio and shifts consumption toward the higher-quality variant (the "shipping oranges to New York" argument). It's directionally right but technically loose — the theorem requires the same absolute cost added to both goods, which isn't how compute works. Opus burns more HBM per token than Sonnet; when the H100 rental rate rises, Opus's per-token cost rises more in absolute terms, not equally. So the clean Alchian-Allen math breaks.
The cleaner version is demand-side. As tasks shift toward reasoning, agents, and long-horizon runs, the marginal utility of a better model rises disproportionately — the difference between "task completes" and "30-hour job fails at hour 27" isn't linear in quality. Frontier-model price power rises not because Sonnet gets relatively more expensive, but because the workload mix is shifting toward tasks where Opus-level capability is load-bearing, and agent users will pay almost anything to avoid re-running a long job. That's complementary-capital economics (better model × more compute-per-task = exponential quality gains), not shipping costs.
The GPU — from triangles to tensors
My request was processed by specific hardware — likely an H100 or Blackwell inside one of those liquid-cooled racks. But how did a chip designed to render triangles become the engine of the AI revolution?
From pixel pushers to tensor engines
The GPU concept (1990s). Early 3D graphics cards were fixed-function pipelines — hardwired circuits for transforming triangles into pixels.
Programmable shaders (early 2000s). GPUs gained small programmable cores that could execute custom programs on each vertex and pixel. Researchers noticed these could be repurposed for non-graphics compute.
GPGPU and CUDA (2006–2007). Before CUDA, GPU compute meant disguising math as graphics. Nvidia's CUDA, launched with the Tesla architecture in 2006, exposed GPU parallel processors through a C-like language.
ATI (later AMD) bet on open standards (OpenCL). CUDA won on tools, documentation, libraries, and academic adoption. By 2012 deep learning the CUDA ecosystem was already six years deep.
The AlexNet moment
In September 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered ImageNet with a deep CNN trained on two Nvidia GTX 580s. AlexNet achieved a top-5 error rate of 15.3%, versus the second-place entry at 26.2%. The field pivoted. Every researcher who wanted to replicate needed Nvidia GPUs and CUDA.
The architecture generations
Nvidia GPU Architecture Timeline
From triangles to tensors — 20 years of GPU evolution for AI
Tesla
128 cores
GDDR3
First CUDA architecture — GPGPU becomes possible
Fermi
512 cores
GDDR5
ECC memory, true IEEE 754 double-precision — HPC credibility
Kepler
2,880 cores
GDDR5
Dynamic parallelism, GPU Direct. AlexNet trained on GTX 580 (prior gen) ignites deep learning
Pascal
3,840 cores
HBM2 / GDDR5X
NVLink debut, HBM2 support (GP100). First GPU explicitly targeting deep learning training
Volta
5,120 cores
HBM2
Tensor cores — 5.12× faster mixed-precision. V100 becomes the AI standard. The inflection point.
Ampere
6,912 cores
HBM2E
3rd-gen tensor cores, sparsity support (2×), MIG partitioning, BF16. A100 dominates training.
Hopper
16,896 cores
HBM3
Transformer Engine (FP8 automatic), NVLink 4.0 (900 GB/s). H100 becomes the AI currency.
Blackwell
~21,000 cores
HBM3E
Two-die design, 2nd-gen Transformer Engine, NVLink 5.0 (1.8 TB/s), HBM3E. B200 doubles Hopper throughput.
Rubin
TBD cores
HBM4
HBM4, next-gen NVLink, new architecture. Jensen's roadmap: annual cadence from here.
Volta (2017) introduced Tensor Cores — specialized matrix multiplication units doing mixed-precision matmul at roughly 5× standard CUDA cores.
Ampere (2020) added third-gen tensor cores with bfloat16, TF32, and 2:4 structured sparsity. MIG (Multi-Instance GPU) partitioned a single A100 into up to seven instances.
Hopper (2022) brought the Transformer Engine with automatic FP8 management. NVLink 4 hit 900 GB/s bidirectional.
Blackwell (2024) is a two-die design connected by a 10 TB/s chip-to-chip link. FP4 support. HBM3E. SemiAnalysis's InferenceMAX v2 published figures for GB300 NVL72 inference throughput at specific interactivity points — the headline number ("up to ~100× a strong H100 baseline") is their benchmark, at their chosen workload; read the dashboard for the full scenario matrix.
Rubin (2026). HBM4 with doubled 2048-bit interface, next-gen NVLink, and a new architecture. Jensen has publicly committed to annual cadence.
The CUDA moat
CUDA is more than a language. It's cuBLAS, cuDNN, cuFFT, NCCL, TensorRT, Triton, two decades of tooling, and the majority of public ML research.
Custom silicon: hyperscaler ASICs
Google TPU is the most credible alternative. Ironwood (TPU v7) was announced at Google Cloud Next and reached GA in November 2025 — 9,216 chips per ICI superpod, 192 GB HBM per chip, 7.37 TB/s HBM bandwidth, 42.5 ExaFLOPS peak per pod. Google and Anthropic announced on October 23, 2025 that Anthropic would expand its TPU use to "up to 1 million TPUs" and "well over a gigawatt of compute" online in 2026, framed as "tens of billions of dollars." Subsequent reporting (April 2026) suggests the arrangement expanded to ~3.5 GW with Broadcom as co-designer; those later numbers are trade-press estimates rather than primary disclosures. SemiAnalysis's TCO math — Ironwood ~44% lower per chip versus GB200 on Google's internal cost basis, ~30–41% lower external rental — is their derivation.
AWS Trainium. Trainium2 went GA December 2024; the Trn2 UltraServer packs 64 chips at 83.2 PFLOPS / 6 TB HBM / 185 TB/s. Trainium3 launched at re:Invent December 2025 on TSMC 3nm — 2.52 PFLOPS FP8 per chip, 144 GB HBM3e, 4.9 TB/s, 4.4× perf and 4× perf-per-watt over a Trn2 UltraServer. Anthropic's "Project Rainier" deployment runs ~500K Trainium2 chips with AWS publicly projecting >1M by end of 2025.
Meta MTIA. Meta announced a rename and refresh in March 2026: the roadmap is now MTIA 300/400/450/500 on roughly a 6-month cadence, not "MTIA v3." MTIA 300 is already in production for ranking and recommendations; MTIA 400 is in lab testing for GenAI inference; MTIA 450/500 are planned. Meta's primary LLM inference serving still runs on Nvidia, with MTIA positioned for ranking/ads today and GenAI inference as a future migration.
Broadcom custom ASIC is the BOM-margin beneficiary behind Google TPU and Meta MTIA — they co-design the silicon and book the silicon revenue. For full-year FY2025 (ending November 2025), Broadcom reported $64B total revenue (+24% YoY) and $20B of AI semiconductor revenue (+65% YoY), with Q4 FY25 AI semi revenue at $6.5B (+74% YoY). Broadcom disclosed five named XPU customers — Google, Meta, OpenAI, Arm/SoftBank, ByteDance — and an AI backlog of ~$73B, including a $10B TPU rack order booked in Q3 and an $11B follow-on in Q4.
Inference-specialist silicon — the challengers to HBM-centric GPUs
A separate class of silicon argues that inference should be its own category, not a subset of training hardware. The thesis: decode is so bandwidth-bound that eliminating HBM entirely in favor of on-die SRAM buys a step-change in tokens/second per user.
Groq (LPU). Language Processing Unit, SRAM-only architecture — 230 MB on-die SRAM (SRAM is 6 transistors per bit, ~25–50× less bit-dense than DRAM at the same node, which is why 230 MB and not 230 GB), ~80 TB/s on-die bandwidth, 188 TFLOPS FP16 / 750 TOPS INT8 per chip, with deterministic compilation (the compiler knows every memory move ahead of time). On Llama-3 70B in 2024 Groq held the public tokens/sec record. By April 2026 the leaderboard has shifted: Artificial Analysis measures Groq at ~276 tok/s on Llama-3.3 70B, with Cerebras and SambaNova now ahead. NVIDIA and Groq announced a $20B arrangement in late December 2025 — structurally an asset purchase plus non-exclusive IP license plus Jonathan Ross moving to NVIDIA, with Simon Edwards becoming Groq CEO. Groq continues to operate independently. Secondary reporting framed this as "NVIDIA acquires Groq" but the documented deal is closer to acqui-hire plus IP cross-license than a full acquisition.
Cerebras. WSE-3 is wafer-scale — 900,000 cores, 44 GB on-wafer SRAM, 21 PB/s memory bandwidth, 4 trillion transistors on TSMC 5nm. Published throughput numbers: Llama-3.1 405B at 969 tok/s; Llama-4 Scout at ~2,600 tok/s; Llama-4 Maverick >2,500 tok/s — third-party corroborated by Artificial Analysis. The OpenAI partnership: announced January 2026 at $10B for 750 MW through 2028; expanded in April 2026 to >$20B with OpenAI receiving equity warrants and a $1B cash commitment toward data-center build. Cerebras refiled its S-1 publicly on April 17, 2026 targeting Nasdaq ticker CBRS at a reported >$30B valuation; the IPO had been delayed since September 2024 over CFIUS review of G42's stake.
SambaNova. Reconfigurable dataflow with a three-tier memory hierarchy (on-chip SRAM + HBM + DDR), not pure SRAM. SN40L chip, primary customers in sovereign AI and regulated enterprise. Artificial Analysis measures Llama-4 Scout at ~697 tok/s and Llama-4 Maverick at ~655 tok/s. Still private; no recent primary valuation disclosure.
Etched, Tenstorrent — flagged as pre-production. Etched's Sohu transformer-ASIC was announced June 2024 with "1 server replaces 160 H100s" marketing; as of April 2026 no customer shipments, no independent benchmarks, no production deployments. Treat the headline numbers as marketing until corroborated. Tenstorrent (Jim Keller, CEO) ships Wormhole and Blackhole cards for order — Blackhole p150 was downgraded from 140 to 120 Tensix cores via firmware in early 2026. $3.2B valuation on the last round; design wins with LG, Hyundai, Samsung.
AMD. MI300X (192 GB HBM3, 5.3 TB/s) shipped to Microsoft Azure, Meta, and Oracle for production workloads. MI325X slipped its intended H200-timed launch and was largely skipped for B200 — SemiAnalysis attributed it to soft demand and reduced HBM. MI350X / MI355X (CDNA4, TSMC N3P, 288 GB HBM3E, 8 TB/s, native FP4/FP6) launched at AMD's Advancing AI event in June 2025 with H2 2025 ramp; the 1.4 kW TDP figure refers to the liquid-cooled MI355X specifically. The real moat gap, per SemiAnalysis's 2024–2025 inference reports: ROCm CI coverage has run at under 10% of NVIDIA's, with meaningful accuracy regressions catching users only because outside observers (SemiAnalysis) published them. Named hyperscaler production adoption remains thin relative to marketing.
NVIDIA Dynamo is NVIDIA's answer — an open-source (Apache-2.0) disaggregated-serving framework announced at GTC March 2025. Dynamo sits above inference engines (TensorRT-LLM, vLLM, SGLang, PyTorch are all supported backends), handling KV-cache routing, smart request routing, and async GPU data transfer. This is the counter-move against SGLang and vLLM becoming the default layer: NVIDIA wants the orchestration layer too, not just the hardware.
Investment takeaway. Nvidia holds the CUDA moat durably for this horizon but isn't unassailable. Three hedges matter: Broadcom as the "if hyperscaler ASICs scale" play (booking TPU/MTIA silicon revenue today); Cerebras and the specialist-silicon complex as the "if inference decouples from training hardware" play (OpenAI's >$20B commitment is the single cleanest external validation); AMD as the "if ROCm catches up" lottery ticket. The specialist challengers don't threaten training workloads — they threaten the inference decode pool specifically, which is the pool growing fastest and which is where Nvidia's HBM-centric margin has been highest.
The memory — where my KV cache lives
Those tens of GB of cached attention state live in HBM — High Bandwidth Memory — soldered right next to the GPU die.
Why memory is the hidden bottleneck
A modern AI accelerator does one thing overwhelmingly: move data. Arithmetic is fast — a Blackwell GPU delivers >20 petaFLOPS of FP4 per Nvidia's spec sheet — but compute units sit idle unless memory feeds them. The compute-to-bandwidth ratio has been worsening for years (the "memory wall").
The three mechanisms underneath
Three different physical mechanisms span the tiers, and the tradeoff axis is the same one across all of them: trade speed for density and persistence by adopting a less-direct way of holding the bit. SRAM (caches and on-die scratchpad) holds each bit in a 6-transistor cross-coupled flip-flop, fast (~1 ns) and refresh-free for as long as power is supplied — the bottleneck on its size is bit-line length, which is why L1 caches are kept tiny on purpose. DRAM stores each bit as charge on a single tiny capacitor next to a single access transistor (a "1T1C" cell); the capacitor leaks across the closed transistor, so every cell has to be refreshed every ~64 ms or it loses its bit, and reads are differential — the bit line is pre-charged to 0.5 V and a sense amplifier detects a swing of ±0.05 V to recover the value. NAND flash (SSDs) traps electrons inside an 8-nm-thick dielectric well, written in by quantum-tunneling them across the barrier under high voltage; one cell holds 3 bits as one of 8 distinct charge levels (TLC), and the tunneling barrier degrades each time you write, which is the physical source of the SSD endurance limit. HBM is DRAM stacked vertically; on-die SRAM in Groq and Cerebras is the same flip-flop scaled up by a couple orders of magnitude. The physics doesn't change between tiers, just the geometry.
The invention of HBM
HBM emerged from AMD / SK Hynix collaboration starting around 2008–2010. The solution: stack DRAM dies vertically, connected by Through-Silicon Vias. Instead of a narrow horizontal bus, HBM runs a 1024-bit bus vertically. JEDEC standardized HBM in 2013; SK Hynix shipped HBM1 in 2015. Nvidia's adoption in Pascal (P100, 2016) made HBM the AI-accelerator standard.
HBM Generation Comparison
Each generation stacks more dies, wider interfaces, exponentially more bandwidth
| Gen | Year | Stack | Capacity | Bandwidth | Interface |
|---|---|---|---|---|---|
| HBM1 | 2015 | 4-hi | 1 GB | 128 GB/s | 1024-bit |
| HBM2 | 2016 | 8-hi | 8 GB | 256 GB/s | 1024-bit |
| HBM2E | 2020 | 8-hi | 16 GB | 460 GB/s | 1024-bit |
| HBM3 | 2022 | 12-hi | 24 GB | 819 GB/s | 1024-bit |
| HBM3E | 2024 | 12-hi | 36 GB | 1.2 TB/s | 1024-bit |
| HBM4 | 2026 | 16-hi | 48 GB | 1.6+ TB/s | 2048-bit |
How stacking works
TSV — Through-Silicon Via
Copper-filled holes etched through each silicon die, creating vertical electrical pathways between stacked layers. Enables thousands of simultaneous connections.
Micro-bump — Solder interconnect
Tiny solder balls (~40μm) connecting adjacent dies in the stack. Each stack has tens of thousands of micro-bumps.
Base die — Logic / buffer die
Bottom die in the stack that interfaces with the processor. Contains I/O circuits, test logic, and the PHY layer that talks to the GPU over the interposer.
CoWoS — Chip-on-Wafer-on-Substrate
TSMC's advanced packaging: GPU + HBM stacks sit on a shared silicon interposer, then on an organic substrate. The interposer is now a bottleneck itself.
The generation leap
- HBM1 (2015): 4-high stack, 128 GB/s per stack
- HBM2 (2016): 8-high, 256 GB/s
- HBM2E (2020): 8-high, ~460 GB/s
- HBM3 (2022): 12-high, ~819 GB/s
- HBM3E (2024): 12-high, ~1.2 TB/s
- HBM4 (2026): 2048-bit interface, ~2.5 TB/s per stack
Those are sustained numbers, not theoretical peaks, and the mechanism behind that is bank-level parallelism. Every HBM stack is internally divided into many banks (16 channels × multiple banks per channel on HBM3E) running their refresh cycles independently — a bank in refresh is unreadable for several milliseconds, so without enough independent banks for the controller to always find a non-refreshing one, the aggregate-bandwidth number is fiction.
Patel's direct shoreline comparison on Dwarkesh: a given ~13mm edge on the die gets you ~2.5 TB/s with HBM4 versus 64–128 GB/s with DDR5 — roughly a 20× bandwidth-per-edge gap. The shoreline constraint is a physical-design fact; the comparison is Patel's.
The wafer-area problem
HBM consumes ~3–4× the DRAM wafer area per gigabyte compared to standard DDR5 (Tom's Hardware pegs it at ~3×; SemiAnalysis at 3–4× gen-over-gen). Per-die capacity is comparable — a 24Gb HBM3E die is roughly the same bit density as a 24Gb DDR5 die — so the wafer tax is all overhead, and the five factors multiply:
- Larger die for the same Gb count (~1.7×) — HBM dies carry TSV landing pads, test structures for stack characterization, and power/signal distribution infrastructure DDR5 doesn't need. 24Gb HBM3E is ~110 mm² vs ~65 mm² for 24Gb DDR5 (SK Hynix data).
- Stack yield loss (~1.5×) — a 12-high HBM3E stack even with pre-binned known-good-dies lands at 50–70% realized yield during ramp vs 85–90% for DDR5 single-die.
- TSV keep-out zones (~1.07×) — each TSV has a ~20µm radial exclusion zone; modern HBM3E has tens of thousands of TSVs per die.
- Looser design rules + RDL overhead (~1.1×) — timing alignment across the stack demands more-conservative layout rules than pure DRAM.
- Logic die at stack base (~1.15×) — every HBM stack has a logic die for the 1024-bit interface, test, and PHY, fabricated on a separate logic process.
Multiplied: ~3.5×, right in the industry range. A fab line optimized for DRAM can print HBM or DDR5 with the same equipment — but shifting capacity HBM-ward reduces total Gb output by 3–4× on the same wafer starts, which is the mechanism behind consumer DRAM price pressure. HBM's share of total DRAM output has risen sharply since 2023 per TrendForce tracking. Micron discontinued its Crucial consumer brand in early 2026.
Patel's consumer-volume framing — iPhone DRAM cost rising from $50 to $150 per device, smartphone volumes falling from 1.4B toward 500–600M per year — is his forecast. No single primary source confirms all three data points; the directional thesis (HBM crowding out consumer DRAM supply) is corroborated by memory-vendor commentary.
The 2026 capex share
Patel's "30% of Big Tech's 2026 capex going to memory" figure is an analyst estimate on hyperscaler capex decomposition. The public hyperscaler capex guidance for 2026 totals ~$600B across the Big-4; the 30%-to-memory apportionment is Patel's.
Nvidia's disclosed figure of "$90B+ in long-term supply commitments" has appeared in multiple earnings calls.
The CoWoS bottleneck
TSMC's CoWoS advanced-packaging technology places GPU die and HBM stacks side-by-side on a large silicon interposer. CoWoS capacity has been tracked by SemiAnalysis, SemiWiki, and Digitimes. The commonly cited progression: ~15K wafers/month in 2023, ~40K by 2025, targeting 90–110K wpm by 2026. Nvidia is reported to have reserved >50% of TSMC's CoWoS allocation for 2026–27.
The competitive landscape
SK Hynix is the HBM leader. TrendForce's Q2 2025 HBM share estimate was ~62%. First to HBM3, first to HBM3E, leading HBM4 development. Primary Nvidia memory supplier.
Samsung has struggled on HBM3E qualification; yields reportedly lagged SK Hynix by ~6 months. Samsung is the only supplier with both DRAM manufacturing and advanced foundry — potentially material for HBM4's tighter logic-memory integration.
Micron is #3 at ~21% HBM share per TrendForce. Micron has reportedly leapfrogged Samsung on HBM3E via TSV power-delivery innovation. CHIPS Act–funded US capacity.
The hierarchy isn't anachronism, it's physics-fit
The cache tiers extend below DDR. The frontier-lab inference stack reportedly parks its hour-class KV cache on flash and spinning disk, which sounds anachronistic until you do the physics: an HDD's read/write head flies ~15 nm above a 7,200 rpm magnetic platter (a human hair is 50,000–100,000 nm wide), the magnetic grains hold their polarity for decades with no refresh tax, and mechanical wear in the actuator and spindle bearings is the only failure mode. When the cache tier's drain time is hours, ~10 ms of seek penalty is rounding error against a per-GB cost advantage that flash and DDR can't match. Each tier in the hierarchy is deployed where its physics suits it; HDDs survive in the 2026 inference stack because of that fit, not despite it.
The foundry — from sand to silicon
That GPU die and those HBM stacks were manufactured somewhere. The "somewhere" is almost certainly TSMC.
How a chip is actually made
A modern processor starts as silicon dioxide. Through ~1,000 process steps spanning 3–4 months, that sand becomes the most complex object humanity manufactures.
Step 1: Crystal growth. Silicon purified to 11-nines via the Siemens process, melted at 1,414°C. A seed crystal is pulled from the melt (Czochralski method). Over 24+ hours a cylindrical ingot up to 200kg forms. Sliced into 300mm wafers polished to atomic smoothness.
Step 2: The wafer fab. Each wafer cycles through deposit → photoresist → lithography → develop → etch → strip → inspect, dozens of times. Class 1 cleanrooms have fewer than one particle per cubic foot larger than 0.5μm.
Step 3: Back-end. Wafer probe → dicing → packaging → final test.
The process node naming game
Process node names stopped corresponding to physical dimensions around 2010. TSMC's N3 achieves ~290M transistors/mm² per published test-chip density figures.
Transistor architecture
Transistor Architecture Evolution
How the gate gained control — from one side to all sides
1 side
gate wrap
Gate sits on top of a flat channel. Current flows in a 2D plane beneath the gate. At ~22nm, leakage current made further scaling impractical — electrons tunneled through the gate oxide.
Every chip made before 2012
3 sides
gate wrap
Channel rises as a vertical "fin" — gate wraps around three sides. Invented by Chenming Hu (UC Berkeley, 1999). Intel shipped the first production FinFET at 22nm (Ivy Bridge, 2012). Dominated for over a decade.
Intel 22nm → TSMC 5nm/3nm
4 sides
gate wrap
Gate wraps completely around horizontal nanosheets (thin channels stacked vertically). Better electrostatic control means less leakage at smaller dimensions. Samsung shipped first production GAA (3nm, 2022). TSMC follows at N2 (2025).
Samsung 3nm GAA → TSMC N2
4 sides × 2 stacked
gate wrap
Complementary FET: stacks NMOS on top of PMOS vertically — two transistors in the footprint of one. Still in R&D. Could enable continued scaling beyond GAA limits.
Research / early development
Planar MOSFETs (1960s–2011) worked until below ~22nm, where gate control broke.
FinFET (2012–2024) raises the channel into a vertical fin the gate wraps on three sides. Invented by Chenming Hu at UC Berkeley in 1999. Intel shipped first at 22nm (Ivy Bridge, 2012).
Gate-All-Around / Nanosheet (2024+) stacks horizontal nanosheets the gate wraps on all four sides. Samsung shipped first at 3nm in 2022 (poor yields initially). TSMC N2 in 2025.
CFET (~2028+) stacks NMOS directly on PMOS — two transistors in one footprint. R&D stage.
The foundry landscape
TSMC (~67% of pure-play foundry revenue) is the dominant force. Founded 1987 by Morris Chang.
The customer-mix shift matters: Patel's framing on Dwarkesh is that Nvidia is now majority of TSMC N3 by 2027, displacing Apple from its traditional first-customer position; A16 (a 2nm-family node) will have an AI customer, not Apple, as first customer. This is Patel's reading of TSMC's customer allocation; TSMC does not disclose customer mix by node.
Samsung Foundry has been struggling. SF2 yield figures (~40% as of Q1 2026) come from third-party trackers, not Samsung disclosures. Tesla's $16.5B Samsung foundry contract (2024) is public.
Intel Foundry is the wildcard. Intel has publicly disclosed 18A progress and target yield improvements; specific Q1 2026 yield figures are analyst estimates. PowerVia (backside power delivery) is Intel's approach for 18A.
The geopolitics you can't ignore
TSMC manufactures a large majority of the world's leading-edge chips on Taiwan. The exact share varies by node — "90% of most-advanced" is a commonly-cited framing, though less-advanced nodes are more distributed.
The CHIPS Act allocated $52.7B to incentivize domestic US fab capacity. TSMC is building fabs in Arizona. Samsung in Taylor, Texas. Intel in Ohio, Arizona, Oregon. Production timelines for the most advanced nodes extend into 2028–2030.
The Huawei / SMIC story: SemiAnalysis's Huawei Ascend deep dive (Sept 2025) reports that HBM, not logic, is the binding China constraint. SMIC 7nm via DUV multi-patterning is producing at reported 45K wpm (2025).
EUV lithography — the machine that made the machine
Every advanced chip in the stack — GPU, HBM, networking ASIC — was patterned using extreme ultraviolet light from an ASML machine.
A 60-year journey to print with light
Every advanced chip begins the same way: light is projected through a pattern onto silicon; exposed areas are chemically etched. The history of semiconductor scaling is largely the history of shorter wavelengths.
Early days (1960s–1980s): contact → proximity → projection lithography.
Mercury arc lamps through the late 1980s (g-line 436nm, i-line 365nm).
DUV revolution (1990s). KrF excimer lasers at 248nm, then ArF at 193nm.
Immersion lithography (2000s). Water between lens and wafer (refractive index 1.44 at 193nm) effectively shrinks wavelength to 134nm. TSMC's Lin Burn-jeng championed the technique. Combined with multiple patterning (2–4 exposures per layer), immersion 193nm carried the industry to 7nm — but by then a single metal layer needed four exposures, and mask sets cost $10–15M.
The 40-year EUV saga
The concept dates to the 1980s at Bell Labs and LLNL. At 13.5nm, you can print in one exposure what DUV needed four passes for. The problem is that 13.5nm light is absorbed by essentially any material — absorption length is under a micron in silicon, glass, or anything a conventional lens could be made of. You can't use lenses at all; the entire optical path has to be reflection-based, and "reflection" at 13.5nm isn't bulk reflection but Bragg diffraction off a periodic multilayer. Even the best multilayers reflect only ~70% per bounce. With 11 mirrors in the optical path, total throughput is roughly 0.7¹¹ ≈ 2% — 98% of source light is lost before it hits resist.
Mirrors. Each EUV mirror is a stack of roughly 50 bilayers of alternating Mo (~4 nm) and Si (~2.8 nm), each layer deposited to sub-Ångström tolerance. At 13.5 nm Mo has high refractive-index contrast relative to Si but low absorption; Si acts as a low-loss spacer. The Bragg condition 2d cos(θ) = mλ with d ≈ 7 nm at near-normal incidence matches 13.5 nm. The theoretical ceiling for Mo/Si at 13.5 nm is ~75% reflectivity, set by residual absorption and interdiffusion (Mo₂Si / MoSi₂ compounds form at interfaces during deposition); realized tools land at ~70% after B₄C diffusion barriers. The 70% isn't engineering laziness — it's near a physical ceiling. Beryllium-based alternatives would push reflectivity a few points higher but are toxic; nothing has shipped. The surfaces themselves are flat to ~0.02 nm RMS, smaller than a single atomic diameter. Carl Zeiss SMT in Oberkochen is the only supplier.
Light source. Fire a high-power CO₂ laser at tin droplets falling through a vacuum chamber at ~50,000 drops/second. Each droplet is hit twice — a pre-pulse flattens the spherical droplet into a low-density disk, a main pulse vaporizes the flat target into a ~500,000°C plasma that emits at 13.5 nm. Two questions the physics answers: why tin? and why the double pulse? Tin's 4d electrons in high ionization states (Sn⁶⁺ through Sn¹⁴⁺) produce a dense unresolved transition array centered on 13.5 nm — no other element has this property at this wavelength. A solid droplet hit with a single pulse forms a plasma dense enough to reabsorb its own EUV; the pre-pulse expansion into a pancake lets the plasma stay optically thin long enough to emit outward. Conversion efficiency from laser light to 13.5 nm EUV is ~3% in production (4–5% in best-case single-droplet experiments). End-to-end wall-plug-to-at-wafer photons is roughly 5 × 10⁻⁵: CO₂ laser wall-plug (~12%) × plasma conversion (~3%) × collector + 11 mirrors (~2%). A 1 MW grid draw produces a few watts of EUV at the wafer. The ASML/Trumpf "1000W source" milestone is specifically about the CO₂ drive laser crossing the threshold for 200+ wafers/hour throughput.
Masks. EUV masks are reflective (not transmissive). Single-particle contamination can print defects across every wafer.
Failed consortia and ASML's bet. Through the 1990s–2000s, multiple consortia (EUV LLC — Intel, AMD, Motorola, etc.) pursued EUV and failed to commercialize. Nikon and Canon attempted and abandoned their own programs. ASML bet the company on EUV. They acquired Cymer (DUV light-source leader) in 2013 for $3.7B. Intel, TSMC, and Samsung each invested $1–4B directly into ASML to fund development — an unusual customer-funds-supplier arrangement.
First production (2019). ASML shipped NXE:3400B after roughly $10B in total R&D. Tools cost ~$150M each at launch.
Today. ASML's NXE:3800E lists in the ~$380M range, processes ~200 wafers per hour, weighs ~180 tons, ships in ~40 containers, contains 100,000+ parts from ~800 suppliers.
High-NA EUV
ASML's EXE:5000 raises numerical aperture from 0.33 to 0.55. First tools shipped to Intel and TSMC for R&D in 2024. Minimum feature size reduces ~1.7×; field size halves. High-volume manufacturing ramps 2026–2027.
The trilateral monopoly
The Trilateral Monopoly
Three companies. One machine. Zero alternatives.
Trumpf
EUV laser source
50 kW CO2 laser fires 50,000 pulses/sec at tin droplets
Carl Zeiss SMT
EUV optics
Multilayer Mo/Si mirrors — flattest surfaces ever made (< 1 atom deviation)
ASML
System integration
Assembles 100,000+ parts into a tool the size of a bus, ~$380M each
Tools/year
~70
Cost per tool
$350-400M
Weight
~180 tons
Parts
100,000+
Mirrors
11 (6% total reflectivity)
Wavelength
13.5 nm
The hard ceiling math
ASML discloses annual EUV tool shipment cadence in its earnings materials and investor day presentations. The trajectory Patel cites on Dwarkesh — ~70 tools in 2026, climbing toward ~100 by end of decade under aggressive expansion, for a cumulative fleet on the order of several hundred by 2030 — is broadly consistent with ASML's own guidance but the specific numbers are his analyst extrapolation.
The implied ceiling ("~200 GW/yr of AI chips if 100% of EUV went to AI") uses his AI Accelerator Model's "3.5 EUV tools per GW of Rubin" figure. Working the forward math: ~55K 3nm wafers + ~6K 5nm wafers + ~170K DRAM wafers per GW × ~15 EUV layers on leading-edge logic ≈ roughly 1–2M EUV layer-passes per GW, depending on how multi-pass alignment steps are counted. Divided by tool productivity (~100 wph effective × 70% utilization × 8,000 hrs/year ≈ ~560K passes/tool/year), that pencils at 2.5–4.5 tools/GW. Patel lands at 3.5 — the midpoint. The downstream ceiling (700 tools / 3.5 tools/GW = 200 GW/year) has ±40% sensitivity: at 2.5 the ceiling is ~280 GW/year, at 4.5 it's ~155. All of the inputs are defensible individually (55K wafers, 15 EUV layers, 200 wph), but the multiplication compounds uncertainty. The direction is robust — leading-edge manufacturing throughput is ASML-gated — but "200 GW/year" is the single highest-sensitivity number in the whole thesis, and worth stress-testing against your own inputs before sizing a position off it.
ASML trades at roughly 30× forward earnings depending on the day. The pricing-discipline framing — that ASML has historically not raised prices faster than it raised tool capability — is Patel's characterization; ASML's margin evolution is public in its financials. The durable insight is that ASML's production capacity is the hard ceiling on leading-edge chip manufacturing globally, with no substitute and no second source. That part is structural, not forecasted.
The margin stack — where the dollar actually goes
Every layer has someone taking a cut. Knowing where the cut gets taken is the whole investment thesis.
Margin Stack
Where the dollar goes at each layer. Tightest red at the top.
| Layer | Names | GM | Pricing power | Scale | Stance |
|---|---|---|---|---|---|
EUV tools | ASML | ~50% | Self-restrained | Hard-capped | Terminal bottleneck 2028–30 |
Leading-edge foundry | TSMC | 50%+ | Rising | Wafer-constrained | Core long, 2026–28 |
HBM memory | SK Hynix · MU · Samsung | 50%+ rising | "Double or triple again" | 4× wafer-area drag | Core long |
Advanced packaging | TSMC CoWoS · Amkor · SPIL | 40%+ | NVDA reserves >50% 26–27 | Capacity-bound | Embedded in NVDA / TSMC |
GPU / accelerator silicon | NVIDIA | ~75% | $90B LT contracts | CoWoS-capped | Core long |
Custom silicon | Broadcom TPU · MTIA | 60%+ | Rising | TSMC-capped | Core long |
Optical transceivers | COHR · LITE · FN | 30–35% | Consumable (replug cycle) | Superlinear with clusters | Satellite long |
Power / cooling | VRT · ETN | 30–35% | Liquid cooling mandatory | Solved-ish | Trim — best gains behind |
Neoclouds (Platinum tier) | CoreWeave | 45–55% | Up 40% in 6 mo | Utilization-dependent | Selective |
Frontier labs | OpenAI · Anthropic · GDM | 40–46% blended | Rising on Opus tier | Spending 2–3× revenue on capex | Private — access via partners |
Token resellers | Together · Fireworks · Bedrock | 10–20% | Commoditizing | Easy entry | Avoid |
A common misconception worth correcting: the 70%+ gross-margin figures circulating for frontier labs are usually compute-only marginal margins on paid API, not fully-loaded. On a Sonnet-class token the marginal compute cost is ~$5/M blended against $15/M billed revenue — that's where "~67%, rounded to 70%" comes from. The fully-loaded number backs out several things the marginal calculation ignores: (1) the free tier, which has compute cost and no revenue; (2) consumer subscription economics, where ChatGPT/Claude.ai power users can cost 4–10× their $20/month subscription; (3) the ~30% revenue share Anthropic pays Bedrock/Vertex/Foundry for partner-channel capacity, which per Patel effectively looks like a 50% markup Anthropic has to eat on spot-compute-bought-late; and (4) spot-compute premiums when growth spikes outrun reserved capacity. Back all that out and the blended GM lands in the 40–50% range per The Information's reporting on leaked Anthropic funding docs. Patel's April 2026 assertion of a 72% floor for Anthropic inference GM is upper-bound analyst math with specific assumptions (that all incremental compute went to inference, not research), not a direct disclosure.
Both numbers are real — they measure different scopes. The gap between them, from ~70% paid-API-marginal down to ~40–50% fully-loaded, is exactly how much margin is flowing down the stack to Nvidia, SK Hynix, TSMC, and the partner-channel clouds rather than staying at the lab. If labs aren't printing as much margin as SV pitch decks assume, margin is flowing further down the stack — which is exactly where capex is going.
Catalysts to watch
These are the thresholds that confirm or break the thesis. I track them every quarter.
Bull confirmations
Size up when these trigger
- 01Anthropic hits 5+ GW capacity by EOY 2026
- 02TSMC CoWoS capacity reaches 90K+ wpm in 2026
- 03ASML order book extends through 2028+ with High-NA bookings
- 04SK Hynix / Micron HBM pricing doubles in 2026 contract cycle
- 05H100 1-yr rental continues rising past $2.35/hr
- 06Claude Code share of GitHub commits crosses 15% in 2026
- 07METR task-horizon doubling stays under 6 months
- 08Anthropic monthly ARR additions stay above $4B
Bear signals
Size down when these trigger
- 01H100 rental prices fall for two consecutive quarters
- 02Anthropic monthly ARR growth decelerates <10% MoM for 3 months
- 03METR task-horizon doubling stalls past 12 months
- 04TurboQuant-scale algo efficiency breakthrough ships in production
- 05Huawei 920 / 930 gains material traction outside China
- 06Hyperscaler 2027 capex guidance below $600B
- 07TPU v8 benchmarks show >40% TCO advantage over Rubin at matched latency
- 08Any Mag 7 CEO shifts public language toward 'overspending'
What's NOT an exit signal: quarter-to-quarter volatility on 3–5 year theses; one bearish research report; consensus agreeing with the trade. The failure mode is selling conviction to noise.
Live tracker
Here's the index in real time, weighted by the allocation thesis:
Compute Index — Live Tracker
The allocation
Here's how I'm thinking about deploying capital across the stack. Weighted by bottleneck tightness, asymmetry, and time horizon.
| Layer | Name | Weight | Rationale |
|---|---|---|---|
| EUV | ASML | 20% | Ultimate chokepoint by 2028. Only company on Earth that makes EUV tools. ~70 tools/yr is a hard physics ceiling — every advanced chip on the planet flows through this bottleneck. |
| Foundry | TSM | 15% | 67% market share, $52-56B capex. The pure-play foundry model Morris Chang invented in 1987 is now the most critical node in the global supply chain. |
| Foundry | Samsung | 10% | Contrarian diversification bet. $73B investment. 2nm GAA at 55-60% yields. Tesla $16.5B deal. Hyperscalers must diversify away from TSMC concentration. |
| Memory | MU / SNDK | 10% | HBM consumes 4x the wafer area per GB versus standard DRAM. The memory crunch is accelerating — prices doubling, multi-year contracts locked. SA's 816% SNDK increase is the signal. |
| Chip design | NVDA | 14% | 20 years of CUDA ecosystem lock-in. From gaming GPUs to tensor cores — Nvidia's architecture evolution (Volta→Hopper→Blackwell→Rubin) defines the AI compute frontier. |
| Chip design | AVGO | 7% | Every hyperscaler designing custom silicon (TPUs, Trainium, etc.) routes through Broadcom. Indirect play on the custom ASIC wave. |
| Infra | CRWV / ORCL | 9% | GPU cloud. CoreWeave 98% on 3+yr contracts — assets appreciate as inference demand grows. Oracle has massive OpenAI backend exposure. |
| Networking | COHR / LITE | 7% | Optical interconnect scales non-linearly with cluster size. Copper fails above rack-scale — coherent optics is the only path to GW-class data centers. |
| Optionality | INTC calls | 3% | Cheap optionality on 18A turnaround. If yields work, Intel becomes the third viable advanced foundry. Binary outcome — options structure defines the loss. |
Key allocation principles
Weight toward the moving bottleneck. ~45% in the semiconductor layer (ASML + TSMC + Samsung) because that's where the constraint is tightening. ~15% in memory. ~21% in chip designers. ~16% in infrastructure and networking. ~3% in optionality.
Match time horizon to bottleneck timing. Neocloud positions are shorter-duration (2026 plays). Memory and foundry are 2026–28 plays. ASML is the longest-duration hold — the constraint that tightens last but matters most.
Use options where conviction is binary. Intel calls are a turnaround-or-bust bet. Defined loss, asymmetric upside.
Accept correlation. This portfolio is ~0.6–0.7 correlated across names — all one macro bet on AI compute demand. More names ≠ more diversification. Size accordingly.
Benchmarking against Situational Awareness LP
Leopold Aschenbrenner's fund's most recent 13F filing shows a concentrated AI-infra book.
| Position | $ Value | Type | Layer |
|---|---|---|---|
| CRWV | $1.21B | Calls + equity | Infra |
| BE | $911M | Equity + calls | Power |
| INTC | $747M | Calls only | Foundry |
| LITE | $479M | Equity | Networking |
| CORZ | $419M | Equity (9.4% stake) | BTC→AI |
| IREN | $329M | Equity | BTC→AI |
| APLD | $278M | Equity | Neocloud |
| SNDK | $250M | Equity (+816%) | Memory |
| EQT | $171M | Equity + calls | Power |
| CIFR | $155M | Equity | BTC→AI |
| COHR | $89M | Equity | Networking |
| + 18 others | ~$500M | Mixed | Various |
Where we agree
Physical infrastructure over algorithms. SA's disclosed 13F shows no model-company equity and no pure software. Every dollar is in the physical substrate.
Memory is accelerating. SA's disclosed SanDisk position change (a large increase) is consistent with the memory-tightening thesis.
Optical networking is underappreciated. Lumentum and Coherent are material SA positions.
The Intel structure is instructive. SA's disclosed Intel position combines common-equity and long-dated call options — a capital-efficient binary-bet structure.
Where we diverge
SA has no ASML, no TSMC, no Samsung Foundry. The biggest gap. Our framework allocates ~45% to the semiconductor layer because the 2028–30 ASML window is specifically where the constraint binds hardest.
SA has no Nvidia. Our read: Nvidia's disclosed $90B+ long-term contracts ring-fence near-term margin. Multiple compresses before revenue does.
SA is heavily weighted to neoclouds and BTC-miner AI pivots. CoreWeave, Core Scientific, IREN, Applied Digital, Cipher.
The leverage. SA's gross-to-net ratio implies substantial leverage; 13F data + public fund disclosures support the "~$5B gross on ~$380M AUM" framing commonly cited, making the book structurally fragile to timing errors. The unlevered version of the same thesis trades maximum upside for structural durability.
Updated adjustments based on SA crossover
Memory: added SNDK alongside SK Hynix / MU. SA's SanDisk increase is the signal.
Networking: increased LITE / COHR to 7%. SA positioning + the superlinear cluster-scaling math.
No power exposure in core. Patel's March 2026 Dwarkesh framing is that US power is no longer the binding constraint — behind-the-meter gas + peaker-displacement can unlock a non-trivial share of US grid capacity. Power moved from bottleneck to cost.
Questions I keep asking
Adapted from Thiel's four monopoly sources. I track these quarterly.
What breaks the thesis
Demand collapse. If AI models stop improving or adoption stalls, the capex cycle looks like overbuilding. The strongest counter today: H100 rental prices rose rather than fell over the last two quarters (SemiAnalysis tracker), and Anthropic's reported ARR trajectory continues to accelerate per multiple trade-press outlets. If supply were overbuilt, prices would be falling. That said, the dot-com 2.0 scenario is always possible. Base-rate estimate: ~10–15% probability over 5 years.
Algorithmic efficiency. The specific risk: breakthroughs that reduce HBM demand without being absorbed by larger models. Google's TurboQuant (Feb 2026) demonstrated 2–3× KV compression via random rotation + Gaussian codebooks, and memory stocks reacted on publication. DeepSeek's MLA already compresses KV ~14× structurally per the DeepSeek-V2 paper. Mamba and state-space models are O(1) memory per token. Counter-argument: every efficiency gain of 2022–2026 (PagedAttention, continuous batching, FP8, FP4, MLA, MTP, prompt caching, disaggregated serving) has been absorbed by demand growth. Result: ~1000× tokens-per-dollar improvement and rising rental prices. Leading signal to watch: HBM3E production guidance. If SK Hynix, Micron, and Samsung start cutting, the efficiency-absorption pattern has broken.
Taiwan. A blockade or attack on TSMC would be catastrophic. Patel's Dwarkesh estimate of the post-event capacity world (on the order of "10–20 GW across Intel and Samsung") is a scenario, not a forecast. TSMC Arizona is a partial hedge but won't scale this decade. The only real protection is position sizing: size TSMC to what you can afford to lose.
Regulatory. Antitrust on Nvidia (low but not zero); export controls expanding to allies (moderate probability); CHIPS Act reversal (low, bipartisan). US government currently encourages the buildout rather than constraining it.
Public backlash. Patel flagged this explicitly on Invest Like The Best (April 2026), predicting "large-scale protests against AI" within three months. His characterization of specific incidents (Sam Altman's home being targeted) should not be repeated as fact without independent corroboration; I have not verified those incidents against credible security reporting. The broader polling point — that AI adoption sentiment has declined — is corroborated in Pew and Edelman trust-barometer data. The mechanism to watch: a major political figure weaponizing AI backlash ahead of the 2026 midterms. If that happens, the regulatory risk node above goes from low-probability to live.
Concentration risk — the real one. This portfolio is estimated at ~0.6–0.7 correlated across names. All one macro bet. Adding more AI names ≠ diversification. Size the portfolio itself, not just the individual positions. My weighted-bear-probability estimate stacking the risks above is ~35% over 5 years. Base-case weighted expected return ~30% after risk adjustment. Real alpha comes from timing the rotation and the asymmetric bets, not the core positions.
This is a research synthesis, not financial advice. Size positions to what you can afford to lose.