Pedram Agand
← Writing
ML / AI

LLM Fine-Tuning Has Three Phases — Here Is How to Pick Yours

QLoRA made finetuning accessible. FSDP made it scalable. GRPO made SFT look expensive. The right approach depends on constraints your benchmark score won't tell

2026-04-07·10 min read·fine-tuning, qlora, fsdp, grpo, llm, training
Use with AI
LLM Fine-Tuning Has Three Phases — Here Is How to Pick Yours

The conversation about fine-tuning LLMs has shifted three times in roughly two years. First it was "fine-tuning is too expensive for most teams." Then it was "QLoRA changed that — any team with a GPU can fine-tune." Then it became "wait, maybe you don't need SFT at all."

Each shift reflects a real change in what's possible and what's efficient. The problem is that each wave of tooling and technique gets discussed in isolation, as if it replaced everything before it. It didn't. QLoRA is still the right answer for many problems. FSDP is necessary in specific scenarios. GRPO-based post-training is compelling but not universally applicable.

What a practitioner actually needs is a map of the tradeoff space — and the decision criteria to navigate it.

Phase One: QLoRA Made Fine-Tuning Accessible

Before QLoRA, fine-tuning a 7B parameter model required roughly 14GB of GPU memory in fp16 (float16) — just for the weights, before accounting for activations, optimizer states, and gradients. For a 70B model, you were looking at 140GB or more. That put meaningful fine-tuning out of reach for most teams.

QLoRA changed the arithmetic in two steps.

Step one: quantize the base model to 4-bit. A 7B model stored in 4-bit precision occupies roughly 3.5GB — a 4x reduction from fp16. The base model weights are frozen. You're not training them. You're loading them as a compressed reference.

Step two: attach low-rank adapter layers. LoRA (Low-Rank Adaptation) works by inserting small trainable matrices alongside the frozen weight matrices. Instead of updating a weight matrix W of dimensions (d × k), you learn two smaller matrices A (d × r) and B (r × k), where r is the rank — typically 8, 16, or 64. The update becomes W + AB. At rank 16, a 4096×4096 weight matrix (16.7M parameters) becomes two matrices of 4096×16 and 16×4096 — about 131K trainable parameters. That's a 99.2% reduction in the parameters you're actually updating.

The combination means you can fine-tune a 7B model on a single 24GB GPU — an RTX 3090 or 4090. You're keeping the base model frozen in compressed form and training only 0.1–1% of parameters in full precision (bf16 or fp16) adapters.

The quality tradeoff is real but bounded. 4-bit quantization introduces noise into the base model's representations. For most task-specific fine-tuning — document classification, instruction following, structured extraction — this noise is small relative to the signal gained from domain adaptation. For tasks that require precise numerical computation or very tight distribution matching, you can see measurable regression versus full fp16 fine-tuning.

When QLoRA Is the Right Choice

QLoRA fits best when: your dataset is under ~100K examples, your task is classification or instruction following rather than complex reasoning, you have one or two GPUs available, and inference latency isn't critically constrained. The adapter weights merge cleanly into the base model at inference time — there's no runtime overhead from keeping adapters separate.

It fits poorly when: you need to fine-tune across a large diverse dataset where the low-rank bottleneck under-fits, the base model quantization causes quality regressions you can measure, or your dataset is large enough that the single-GPU memory advantage is offset by training time.

Phase Two: FSDP Made Fine-Tuning Scalable

Single-GPU QLoRA has a ceiling. Even with quantization, there are scenarios where you need to scale: very large datasets that require many training steps, larger base models (30B+), or cases where the quality loss from 4-bit quantization is unacceptable and you need fp16 or bf16 base weights.

PyTorch FSDP (Fully Sharded Data Parallel) is the mechanism for scaling across multiple GPUs. Understanding what it actually does — versus what naive data parallelism does — is prerequisite to knowing when you need it.

Naive data parallelism copies the full model to each GPU and sends different batches to each copy. Each GPU holds the complete model in memory. Memory per GPU is unchanged; you just process more data per step.

FSDP shards the model itself. Each GPU holds only a fraction of the model's parameters, optimizer states, and gradients. Before a forward or backward pass through a layer, FSDP gathers the full parameters for that layer across GPUs via an all-gather collective, computes, then discards the gathered parameters again. Each GPU never holds more than its shard plus one layer's worth of gathered parameters at a time.

This changes the memory model entirely. With 8 GPUs and FSDP, a 70B model in fp16 (140GB total) requires roughly 17.5GB per GPU — within reach of a node of H100s or A100s. Without FSDP, you'd need 140GB per GPU, which isn't available in any consumer or entry-level hardware.

FSDP and QLoRA compose. You can shard a quantized model across GPUs, training LoRA adapters in full precision while the base model stays in 4-bit across shards. This combination — sometimes called QLoRA+FSDP — is what enables fine-tuning 70B models on multi-GPU setups without a datacenter budget.

What FSDP Changes About Training Complexity

FSDP is not a drop-in replacement for single-GPU training. The all-gather collectives add communication overhead. The performance depends heavily on interconnect bandwidth — NVLink within a node is fast; cross-node via InfiniBand is slower and introduces more tuning surface. Batch size, micro-batch size, gradient accumulation steps, and sharding strategy all interact.

The practical implication: if single-GPU QLoRA fits your model and dataset, use it. FSDP adds operational complexity that only pays off when you actually need the memory or throughput. The break-even point is roughly: base model larger than 13B parameters in fp16, or base model larger than 30B in 4-bit, or a training dataset large enough that single-GPU throughput becomes a bottleneck (typically over 500K examples with multi-epoch training).

Phase Three: The Case Against SFT for Reasoning Tasks

Supervised fine-tuning — whether full fine-tuning, LoRA, or QLoRA — trains the model to imitate demonstrations. You provide (input, target output) pairs. The model learns to produce outputs that resemble the targets in your training set.

This works well when your target behavior is well-represented by demonstrations. It works poorly when the capability you want is reasoning that requires exploration — working through a problem step by step, verifying intermediate results, backtracking when an approach fails.

The problem with SFT for reasoning is called mode covering. A model trained to maximize likelihood over demonstration data learns to produce the average of what good responses look like. It doesn't learn the process of arriving at correct answers — it learns the surface form of correct answers. For tasks with a single clear correct answer (classification, extraction, summarization), this distinction doesn't matter much. For math, code generation, or multi-step logical inference, it does.

What Meta's Post-Training Research Shows

Meta's work on training reasoning capabilities in LLMs — including what underpins models like Llama 3's reasoning variants — demonstrated that you can induce reasoning behavior by optimizing for outcome correctness rather than demonstration imitation. The relevant technique is Group Relative Policy Optimization (GRPO), a variant of reinforcement learning from human feedback that sidesteps the need for a separate reward model.

GRPO works by generating multiple candidate responses for each prompt, scoring them against a verifiable reward signal (correct vs incorrect for math problems, passing vs failing tests for code), and using the relative scores within the group to compute a policy gradient update. The model learns which types of responses tend to be correct — not by imitating demonstrations, but by experiencing the consequences of different response strategies.

The striking finding from Meta's research: reasoning capabilities could be induced by modifying a small fraction of the model's parameters — on the order of 1.3% — using GRPO with a curriculum of verifiable problems. The base model already contains the latent capability; post-training is steering it toward expressing that capability consistently.

This has a direct practical implication: for reasoning tasks, gathering high-quality demonstration data for SFT may be doing less work than it appears. A carefully constructed set of problems with verifiable outcomes — which is often cheaper to assemble than expert-annotated demonstrations — can be more effective training signal.

SFT Is Not Dead

GRPO and outcome-based optimization work when you have a reliable reward signal. That requires a verifiable criterion: the answer is correct or it isn't. Math problems have this property. Code has this property (tests pass or fail). Most business tasks do not.

For a document classifier in a regulated financial services workflow, you don't have a verifiable reward signal — you have labeled examples. SFT is appropriate. For a contract clause extractor, you have structured targets — SFT is appropriate. For a document summarizer, verification is expensive and subjective — SFT from demonstrations is likely the better path.

GRPO becomes relevant when: the capability you want is reasoning or multi-step problem solving, you can construct or synthesize a large set of problems with automated verification, and you have the compute for multiple rollouts per prompt (GRPO requires generating 4–16 candidates per training example, which is 4–16x the forward pass cost of SFT).

The Decision Framework

Given a fine-tuning task, the right approach follows from three constraints: your data, your compute, and the nature of the target capability.

Data constraint. Under 10K labeled examples: QLoRA with careful regularization. 10K–500K labeled examples: QLoRA or full SFT with LoRA depending on quality requirements. Over 500K examples or requiring GRPO rollouts: you need multi-GPU infrastructure.

Compute constraint. Single GPU (24GB): QLoRA on models up to 13B in 4-bit, or up to 7B in fp16 LoRA. Multi-GPU single node (8×80GB): FSDP covers models up to 70B in fp16 or 405B in 4-bit. Multi-node: reserved for very large models or very large datasets; the communication overhead needs to be justified by scale.

Capability type. Classification, extraction, instruction following, domain adaptation: SFT (QLoRA or full, depending on scale). Reasoning, multi-step problem solving, code generation with automated tests: evaluate GRPO or STaR-style self-improvement. Mixed tasks with a combination of both: fine-tune for domain knowledge with SFT first, then apply preference optimization on top.

The Regulated Environment Consideration

If you're deploying in a regulated environment — financial services, legal, healthcare — there's a fourth constraint that overrides the others: auditability of training data.

GRPO-trained models are harder to audit. The training signal is the model's own outputs scored against a reward function, not a curated set of human-approved demonstrations. When a regulator asks "what data did you train this model on," the honest answer for a GRPO-trained model includes "synthetic rollouts generated during training." Some regulators are comfortable with this; many are not yet.

For regulated deployment, SFT from a controlled, documented dataset remains the defensible choice even when GRPO might produce better benchmark performance. The auditability cost of GRPO is real and organization-specific. Evaluate it against the capability benefit before choosing.

Putting It Together

Fine-tuning has three phases because each phase addressed a different constraint. QLoRA addressed memory — making large model adaptation accessible on commodity hardware. FSDP addressed scale — making it practical to train models that exceed single-GPU memory. GRPO addressed the SFT ceiling for reasoning — replacing imitation with outcome optimization.

These phases are cumulative, not sequential. A current production system might use QLoRA for a classification head, FSDP for training a larger generation model, and GRPO for a reasoning component — all within the same product.

The mistake is treating each technique as universally superior to its predecessors. The right question is not "which fine-tuning method is best" but "given my data volume, compute budget, target capability, and auditability requirements, which approach fits my constraints." Each of the three phases gives you a tool. The practitioner's job is knowing which tool the problem calls for.

Want this implemented in your workflow?

I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call — no pitch, just a focused conversation about your situation.

I publish one post like this per month. Join AI Command Room and I'll send it directly to you.