My LLM Learning Roadmap for 2026
A structured, opinionated path through the resources I've found genuinely useful for learning LLMs — from attention to production.

Every few months someone asks me where to start with LLMs. I've given the same answer enough times that it's worth writing down — not as a list of "top resources," but as a path I'd actually walk again if I were starting from zero in 2026.
This is based on what I've used while building and deploying language models in production contexts. Some of these resources I've returned to repeatedly. Others I burned through once and moved past. I'll tell you which is which.
The foundation you can't skip
Before anything LLM-specific, you need three things to be solid: linear algebra (particularly eigenvalues, matrix decompositions), probability (Bayes, maximum likelihood, KL divergence — these will come back constantly), and PyTorch fundamentals. I'm not saying you need a PhD in each. I'm saying that if you hit a paper on LoRA or quantization and the math stops you cold, you'll keep stalling at the same wall.
For linear algebra and probability, 3Blue1Brown's essence series is genuinely excellent — not because it goes deep, but because it builds the right geometric intuition fast. StatQuest by Josh Starmer is my go-to for statistics when I need to rebuild intuition on something I learned once and half-forgot. For PyTorch, Patrick Loeber's beginner series on YouTube covers enough to be functional. The PyTorch docs are eventually your primary reference, but they're dense for day one.
Understanding how transformers actually work
Once the foundation is in place, the first serious investment is in understanding the transformer architecture — not just as "the thing behind ChatGPT" but well enough to reason about its failure modes.
Jay Alammar's Illustrated Transformer and Illustrated GPT-2 are the best visual explanations of these models I've found. They don't replace reading the original attention paper (Vaswani et al., Attention Is All You Need), but they make the paper readable. I'd do both: Alammar first, then the paper, then Alammar again.
For the implementation side, Andrej Karpathy's nanoGPT video is two hours of dense, useful material. He builds a working GPT from scratch in PyTorch while explaining every decision. If you can follow this without pausing constantly, your fundamentals are in reasonable shape.
After this, Lilian Weng's Attention? Attention! fills in the theoretical gaps. She's VP of Safety at OpenAI, and her blog posts are among the best technical writing in this field — rigorous but readable.
What you get from this phase: a working mental model of tokenization, self-attention, positional encoding, and how decoder-only generation works. This is the model that underlies every LLM you'll interact with in production.
Going deeper: Sebastian Raschka's book
Build a Large Language Model (From Scratch) by Sebastian Raschka is the resource I recommend most often to practitioners who already have some Python fluency. It walks from raw text data through attention mechanisms to a functioning GPT-style model, chapter by chapter, with complete code.
What separates it from tutorials is that Raschka explains why each design choice was made, not just what it is. The chapter on coding attention mechanisms alone is worth the price. The companion notebooks are clean and well-maintained on GitHub.
This is not a weekend read. Plan two to four weeks if you're treating it seriously alongside other work. After finishing it, pretraining should stop feeling like magic and start feeling like an engineering problem with specific costs and tradeoffs.
Fine-tuning: where most practitioners spend their time
In my experience, the majority of practical LLM work isn't pretraining — it's fine-tuning and evaluation. Pretraining a model from scratch requires compute at a scale (Llama 2 was trained on 2 trillion tokens) that is not available to most practitioners or engineering teams. What is available is fine-tuning.
The most practically important techniques are LoRA (Hu et al., 2021) and its quantization-aware cousin QLoRA (Dettmers et al., 2023). LoRA works by freezing the pretrained weights and adding a pair of low-rank adapter matrices — the trainable parameter count drops by 90%+ compared to full fine-tuning, and the results are close enough for nearly every production use case I've seen.
For hands-on fine-tuning, the mlabonne LLM course on GitHub (over 9,000 stars) has the best collection of Colab notebooks I've used: fine-tuning Llama 2, DPO on Mistral-7B, quantization comparisons. Each notebook is self-contained and runs on free-tier Colab. The three-part structure — Fundamentals, LLM Scientist, LLM Engineer — means you can jump in at whatever level is appropriate.
The tooling that makes production fine-tuning practical is Unsloth, which delivers 2-5x faster fine-tuning with 70% less memory compared to vanilla HuggingFace. For larger-scale work, Axolotl handles multi-GPU setups and a range of training configurations cleanly.
One thing I'd add that most tutorials skip: dataset quality matters more than training duration. A well-curated 5,000-example instruction dataset will outperform a noisy 100,000-example one. Spend the time on filtering.
Evaluation: the part that actually determines production quality
This is where I see the most underinvestment. Teams will spend weeks on fine-tuning and two days on evaluation, then be surprised when the model behaves inconsistently in production.
Goodhart's Law applies directly: once a benchmark becomes a target, it stops being a useful measure. BLEU and perplexity are both problematic in most generation contexts — good to understand, but not primary signals. The Chatbot Arena leaderboard is a useful general reference, but it doesn't tell you how your model performs on your tasks.
For production evaluation, Hamel Husain's approach is the most practical I've encountered: treat the agent or model as a black box first (did it satisfy the user's goal?), then move to step-level diagnostics once error analysis identifies which workflows fail most often. His Ultimate Evals FAQ is one of the best free resources in this space, covering LLM-as-judge scoring, trace analysis, and transition matrices for agentic workflows.
The other resource worth reading is the Survey on Evaluation of LLMs by Chang et al. — comprehensive coverage of what to evaluate, where, and how.
Quantization and inference: making models run where you need them
Running a 7B parameter model in FP32 requires roughly 28GB of VRAM. Quantizing to 4-bit brings that down to around 4-5GB — the difference between requiring an A100 and running on a consumer GPU. This matters both for local deployment and for inference cost at scale.
The practical options in 2026: GGUF/llama.cpp for CPU-friendly deployment (well-maintained, runs anywhere), GPTQ for GPU-only inference with strong speed, and AWQ if you need lower perplexity at the cost of more VRAM. For most use cases, GGUF with llama.cpp is the right starting point. The mlabonne quantization notebooks above cover all three with working code.
For serving at scale, the HuggingFace docs on GPU inference optimization and LLM speed/memory optimization are the canonical references. Flash Attention (Dao et al.) is now standard in most model implementations; understanding why it reduces attention complexity from O(n²) to O(n) is worth the time.
The production layer
Once you can train and serve a model, the next gap is system design: how do you build applications around LLMs that behave reliably?
Chip Huyen's Agents deep dive (roughly 8,000 words) is the most comprehensive framework I've found for thinking about agentic system design — environment, action space, planning, observability, guardrails. Her book AI Engineering (O'Reilly, 2024) covers the full production stack for teams deploying these systems at scale.
Anthropic's Building Effective Agents post is required reading for the orchestrator-workers and evaluator-optimizer patterns. These patterns come up constantly in real deployments. The multi-agent research system post on the same site covers actual engineering tradeoffs from production — not theory.
For teams working in regulated industries specifically, evaluation and observability aren't optional. A model that performs well on average but fails unpredictably on a small class of inputs is an operational liability. Build the eval harness before you build the feature.
How I'd sequence this
Starting from Python fluency but no ML background, I'd run this in roughly this order:
- Math foundations (3Blue1Brown linear algebra, Khan probability) — two to three weeks
- Karpathy's nanoGPT + Alammar's illustrated posts — two weeks
- Raschka's book alongside the mlabonne fundamentals notebooks — four to six weeks
- Fine-tuning track: LoRA/QLoRA in Colab, then Axolotl for anything multi-GPU
- Evaluation setup before any production deployment
- Quantization as needed for your inference constraints
- Huyen's agents framework when you're ready to build systems
The temptation is to move fast through foundations and spend most of your time on the tooling layer. I'd resist that. The practitioners I've seen struggle most are the ones who can run a fine-tuning script but can't reason about why the model is failing.
Want this implemented in your workflow?
I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call — no pitch, just a focused conversation about your situation.
I publish one post like this per month. Join AI Command Room and I'll send it directly to you.