A visual journey through how large language models process your words — from text to tokens to GPU cores and back.
You type a prompt and hit enter. Your text travels as UTF-8 encoded bytes through an API to a server rack housing GPUs. The raw string is just the beginning — it's about to be decomposed into something the model can actually process.
A tokenizer (like BPE) splits your text into subword pieces. "networks" becomes "net" + "works". Each token maps to an integer ID from a fixed vocabulary of ~100,000 entries. This is the model's alphabet — not letters, but meaning-carrying fragments.
Each token ID is looked up in an embedding table — a massive matrix stored in VRAM. Token 2585 ("How") becomes a vector of 4,096 floating-point numbers. These vectors encode semantic meaning: similar words cluster together in this high-dimensional space.
A 70B parameter model needs ~140 GB in full precision. VRAM holds the entire model on-chip, right next to the compute cores. If we used system RAM, every calculation would wait for data to crawl across the PCIe bus — a 10-50x bottleneck.
This is the transformer's superpower. Each token computes Query, Key, and Value vectors, then calculates attention scores with every other token simultaneously. "learn" attends strongly to "neural" and "networks" — discovering relationships the model needs.
This is an O(n²) operation — and it happens in parallel across thousands of GPU cores. A CPU would compute these one pair at a time. The GPU does them all at once.
Every forward pass involves thousands of matrix multiplications. Each one multiplies matrices with millions of elements. GPUs have Tensor Cores — specialized hardware that multiplies 4×4 matrices in a single clock cycle.
After attention, each token passes through a feed-forward network — two giant matrix multiplications with a nonlinearity (SwiGLU) between them. The hidden dimension expands to 4x the model dimension, then compresses back. This is where the model stores factual knowledge.
Attention + FFN = one transformer layer. Modern LLMs stack 80+ layers deep. Your 7 tokens traverse all of them, every single inference.
HBM3 VRAM achieves 65x the bandwidth of DDR5 RAM. The secret is the bus width: VRAM uses a 8,192-bit wide bus versus RAM's 64-bit bus. It's like comparing a 128-lane highway to a single-lane road.
LLMs are memory-bandwidth bound during token generation. Every output token requires reading the entire model's weights from memory. VRAM's massive bandwidth is what makes real-time generation possible.
The final layer projects back to vocabulary size — 100,000+ logits. Softmax converts these raw scores into a probability distribution. Temperature controls randomness: low temperature makes the model confident, high temperature makes it creative.
One token is sampled from this distribution. Then the entire process repeats — for every single output token.
Each generated token feeds back into the model as input for the next. The KV cache in VRAM stores previously computed attention states so the model doesn't recompute them — a critical VRAM optimization that trades memory for speed.
At ~50-100 tokens/second on an A100, the model reads ~140 GB of weights from VRAM for each token. That's 7-14 TB/s of memory access — only possible because the data lives in VRAM, millimetres from the compute cores.
Text → Tokens → Embeddings → VRAM → Attention → MatMul → FFN → Softmax → Sample → Output
Every step happens inside GPU memory. VRAM isn't just faster RAM — it's a fundamentally different memory architecture with 128x wider buses and 65x higher bandwidth, purpose-built for the parallel, uniform access patterns that define neural network inference.
Built for kevinpaul.au