Inside the Machine

A visual journey through how large language models process your words — from text to tokens to GPU cores and back.

SCROLL TO BEGIN

Stage 01 / 10

Your Words Enter the System

01001000 01100101 01101100 01101100 01101111

"How do neural networks learn?"

37 characters → UTF-8 bytes → API request

You type a prompt and hit enter. Your text travels as UTF-8 encoded bytes through an API to a server rack housing GPUs. The raw string is just the beginning — it's about to be decomposed into something the model can actually process.

Stage 02 / 10

Tokenization Breaks Language Apart

How

_do

_neural

_net

works

_learn

?

          [2585, 656, 30828, 4272, 11726, 4048, 30]
        

A tokenizer (like BPE) splits your text into subword pieces. "networks" becomes "net" + "works". Each token maps to an integer ID from a fixed vocabulary of ~100,000 entries. This is the model's alphabet — not letters, but meaning-carrying fragments.

Stage 03 / 10

Tokens Become Vectors

        7 tokens × 4096 dimensions = 28,672 floats
      

Each token ID is looked up in an embedding table — a massive matrix stored in VRAM. Token 2585 ("How") becomes a vector of 4,096 floating-point numbers. These vectors encode semantic meaning: similar words cluster together in this high-dimensional space.

4,096

dimensions per token embedding

Stage 04 / 10

The Model Lives in VRAM

VRAM (GPU) 73.6 / 80 GB

RAM (CPU) would need 140+ GB

embedding_table [400M params]

layer_0.attention [16M params]

layer_0.ffn [33M params]

... × 80 layers ...

layer_79.attention [16M params]

layer_79.ffn [33M params]

output_head [400M params]

KV cache [dynamic]

A 70B parameter model needs ~140 GB in full precision. VRAM holds the entire model on-chip, right next to the compute cores. If we used system RAM, every calculation would wait for data to crawl across the PCIe bus — a 10-50x bottleneck.

Stage 05 / 10

Self-Attention: Every Token Looks at Every Other

Howdoneuralnetworkslearn

This is the transformer's superpower. Each token computes Query, Key, and Value vectors, then calculates attention scores with every other token simultaneously. "learn" attends strongly to "neural" and "networks" — discovering relationships the model needs.

This is an O(n²) operation — and it happens in parallel across thousands of GPU cores. A CPU would compute these one pair at a time. The GPU does them all at once.

Stage 06 / 10

Matrix Multiplication: The GPU's Native Language

×

=

Every forward pass involves thousands of matrix multiplications. Each one multiplies matrices with millions of elements. GPUs have Tensor Cores — specialized hardware that multiplies 4×4 matrices in a single clock cycle.

16,384

CUDA cores firing in parallel (A100)

Stage 07 / 10

Feed-Forward: Deep Processing

After attention, each token passes through a feed-forward network — two giant matrix multiplications with a nonlinearity (SwiGLU) between them. The hidden dimension expands to 4x the model dimension, then compresses back. This is where the model stores factual knowledge.

Attention + FFN = one transformer layer. Modern LLMs stack 80+ layers deep. Your 7 tokens traverse all of them, every single inference.

Stage 08 / 10

Why VRAM Dominates

🐢

System RAM

51

GB/s bandwidth

⚡

HBM3 VRAM

3,350

GB/s bandwidth

RAM

VRAM

RAM: 64-bit bus

HBM3: 8192-bit bus

HBM3 VRAM achieves 65x the bandwidth of DDR5 RAM. The secret is the bus width: VRAM uses a 8,192-bit wide bus versus RAM's 64-bit bus. It's like comparing a 128-lane highway to a single-lane road.

LLMs are memory-bandwidth bound during token generation. Every output token requires reading the entire model's weights from memory. VRAM's massive bandwidth is what makes real-time generation possible.

Stage 09 / 10

Softmax: Choosing the Next Word

Neural

72%

Deep

15%

Machine

8%

The

3%

...

2%

The final layer projects back to vocabulary size — 100,000+ logits. Softmax converts these raw scores into a probability distribution. Temperature controls randomness: low temperature makes the model confident, high temperature makes it creative.

One token is sampled from this distribution. Then the entire process repeats — for every single output token.

Stage 10 / 10

Token by Token, the Answer Appears

Each generated token feeds back into the model as input for the next. The KV cache in VRAM stores previously computed attention states so the model doesn't recompute them — a critical VRAM optimization that trades memory for speed.

At ~50-100 tokens/second on an A100, the model reads ~140 GB of weights from VRAM for each token. That's 7-14 TB/s of memory access — only possible because the data lives in VRAM, millimetres from the compute cores.

~140 GB

of weights read from VRAM per output token

The Full Picture

Text → Tokens → Embeddings → VRAM → Attention → MatMul → FFN → Softmax → Sample → Output

Every step happens inside GPU memory. VRAM isn't just faster RAM — it's a fundamentally different memory architecture with 128x wider buses and 65x higher bandwidth, purpose-built for the parallel, uniform access patterns that define neural network inference.

Built for kevinpaul.au