ML / AI

What Building an LLM From Scratch Actually Teaches You

Reading about transformers is not the same as implementing one. Here are the insights that only come from writing the code, watching it fail, and understanding…

2026-04-07·9 min read·llm, transformer, architecture, training, deep-learning

Use with AI

ShareX LinkedIn

What Building an LLM From Scratch Actually Teaches You

There is a class of understanding you cannot acquire by reading papers or calling APIs. You get it by implementing something, watching it fail, and figuring out why. Building a language model from scratch is the most direct version of this for LLM practitioners.

This is not a tutorial. Sebastian Raschka's book does that job well. This is about what the implementation teaches you - the things that become obvious only after you've written the code and observed the behavior firsthand.

Tokenization Is Not a Preprocessing Detail

When you implement a tokenizer from scratch - BPE, WordPiece, SentencePiece - the first thing you notice is how many decisions are baked into it before training begins. Vocabulary size is one of them. A vocabulary of 30k tokens handles most common words as single tokens. A vocabulary of 50k handles more, but the embedding table grows proportionally, and the final projection layer over the vocabulary becomes a non-trivial fraction of total parameters.

The downstream effects are real. A tokenizer trained on English text will split common French or Spanish words into multiple subword tokens. For the same semantic content, a French document requires more tokens than an English one. Your context window fills faster. Your model sees less per forward pass. This is not an edge case - it's a structural property of the vocabulary that shapes everything downstream.

Out-of-vocabulary handling reveals something else: the model's behavior at the edges of its training distribution is a tokenizer property as much as a weights property. When a term isn't in the vocabulary, it fragments into subwords. Whether those subwords are meaningful - whether "transformer" and "trans" and "former" share useful representations - depends on what the tokenizer saw at training time.

You only fully grasp this when you see how vocabulary choices affect the model's behavior on novel inputs. In production, when your deployed model handles domain-specific terminology poorly, the fix is rarely fine-tuning alone. The tokenizer is often the root cause.

The Quadratic Cost Becomes Physical

Reading that self-attention scales quadratically with sequence length is one thing. Implementing multi-head attention and then doubling your sequence length is another.

When you run it, you watch your memory usage grow. You see why 4k context and 128k context are not just a slider - they're different memory budgets, different inference costs, different hardware requirements. The quadratic relationship goes from an abstract notation in a paper to a number you're staring at in your CUDA memory profiler.

This also makes the architectural alternatives legible. State space models, linear attention variants, sliding window attention - these are not arbitrary alternatives to the standard transformer. They are specific responses to this specific bottleneck. When you understand the bottleneck as a cost you've paid, you understand what each alternative is actually trading off and why the tradeoffs matter for different use cases.

Multi-Head Attention Mechanics

Implementing multi-head attention also clarifies what the multiple heads are doing. Each head projects keys, queries, and values into a lower-dimensional subspace and computes attention there. The intuition is that different heads can attend to different kinds of relationships - syntactic structure in one, semantic content in another, positional proximity in a third. You see this empirically when you visualize attention weights: heads do specialize in practice, though not in ways you can fully predict before training.

Understanding this changes how you interpret attention visualization tools in production. Attention weight heatmaps are not a clean window into model reasoning. They show what positions the model attends to, not why, and the relationship between attention weights and model behavior is indirect. This is a useful corrective against over-interpreting interpretability tools.

Layer Norm Placement Has Empirical Consequences

The original transformer paper uses post-layer normalization: compute the sublayer output, add the residual, then normalize. Most modern architectures use pre-layer normalization: normalize the input before the sublayer, then add the residual. The difference is a few lines of code. The effect on training stability is not small.

When you implement both and train with the same hyperparameters, post-norm models diverge more readily, especially at higher learning rates. Pre-norm models tolerate a wider range of learning rates and tend to have smoother loss curves in the early part of training. The theory is that pre-norm keeps the gradient signal more stable through depth - you can derive it, but watching it happen in your own training run makes the mechanism intuitive rather than theoretical.

This matters for a specific production scenario: when you're fine-tuning a pre-trained model and deciding how aggressively to set your learning rate. Models with pre-norm architectures (GPT-2 style, LLaMA) are more forgiving. Models with post-norm architectures require more careful scheduling and lower peak learning rates. Knowing which architecture you're working with, and why it affects training dynamics, lets you make that decision from first principles rather than trial and error.

Training Instability Is Diagnostic, Not Random

The standard training recipe for transformers includes gradient clipping, learning rate warmup, and weight decay. These are often presented as "best practices" without explanation. Implementing a training loop and removing each one shows you what failure mode each addresses.

Remove gradient clipping: gradients explode during forward passes that hit large attention weights. You see loss spike and then diverge. The clip threshold is a bound on how much a single bad batch can damage the model.

Remove learning rate warmup: training becomes more sensitive to initialization. The initial parameter values, especially in the attention projection layers, are random. Large gradients early in training push the model into poor regions of the loss landscape before it has the context to navigate usefully. Warmup gives the parameters time to find a reasonable starting configuration before the optimizer takes large steps.

Remove weight decay: the model overparameterizes more aggressively. On small datasets, you see test loss diverge from training loss earlier. Weight decay is an explicit regularization term that prevents individual parameters from growing arbitrarily large - in practical terms, it prevents the model from memorizing training examples at the expense of generalization.

Each of these failure modes is observable. Each corresponds to a specific pathology in the optimization landscape. When your production fine-tuning runs fail, recognizing these patterns in the loss curve is what separates a fast diagnosis from a multi-day debugging session.

The Loss Curve as a Diagnostic Instrument

You learn to read the loss curve the way a clinician reads a vital sign: not just whether the number is going down, but what the shape tells you about the system's state.

A smooth, steady descent with training and validation loss tracking closely: the model is learning genuine structure from the data. A plateau early in training: typically a learning rate issue - either too low to make progress or already past the optimal range and bouncing around a local minimum. Spikes in training loss that recover: gradient explosions that were clipped; increase your clip threshold or reduce learning rate. Training loss descending while validation loss stagnates or rises: overfitting, often visible much earlier than people expect on small datasets.

The specific pattern that surprises most practitioners: validation loss often starts rising well before training loss stops improving. If you're running long fine-tuning jobs and checking validation loss only at the end, you're likely deploying an overfit model without realizing it. This is visible from the first epoch if you're logging it.

Emergent Capabilities Are Threshold Effects

At small parameter counts, a model trained on next-token prediction learns to reproduce local statistical patterns in text. It can complete sentences in a plausible style. It cannot answer questions, follow instructions, or reason through multi-step problems.

As you scale - more parameters, more data, longer training - new behaviors appear that weren't present at smaller scale. This is what's called emergence, and the framing in popular discourse makes it sound more mysterious than it is.

Building from scratch and training at different scales makes the mechanism clearer. The model is learning a function approximation over a very high-dimensional space. Certain capabilities require the model to have learned enough of the underlying structure of language that it can generalize to new task types. Below a threshold of representational capacity, that generalization doesn't happen. Above it, it does. The threshold is not sharp - it appears sharp when you're looking at capability benchmarks that have discrete pass/fail criteria, but the underlying model quality is improving continuously.

The practical implication: emergence is not a property of a specific model size. It's a function of model capacity relative to the complexity of the task and the quality and quantity of training data. When you're evaluating whether a smaller model can handle your use case, the question is whether your task's threshold is below or above that model's representational capacity. Building from scratch gives you the intuition to estimate that.

The Production Payoff

Every insight above connects to a concrete failure mode in production LLM systems.

Tokenization understanding explains why domain-specific terminology degrades model performance and why vocabulary coverage is part of model selection, not just an afterthought. Attention mechanics explain why long-context performance isn't uniformly good across all positions in the context window - the model attends to some positions more than others, and position matters. Training stability knowledge explains fine-tuning failures that look like hardware issues or data issues but are actually optimization issues. Loss curve literacy means faster debugging cycles when a fine-tuning run goes wrong. Emergent capability framing gives you a realistic framework for evaluating whether a given model can handle a given task, rather than relying on benchmark marketing.

None of this requires that you implement an LLM from scratch before every deployment. It does suggest that working through one implementation - at small scale, for educational purposes - pays dividends over time in the quality of your production decisions.

The gap between "knows how to call an LLM API" and "understands what the LLM is doing" is where most production failures live.

Want this implemented in your workflow?

I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call - no pitch, just a focused conversation about your situation.

Book a strategy call →Download the checklist →

I publish one post like this per month. Join AI Command Room and I'll send it directly to you.