Pedram Agand
← Writing
ML / AI

LLM Architecture Is a Decision, Not a Default

Decoderonly, encoderonly, MoE, SSMs — each architecture solves different constraints. Here is the decision framework for choosing based on your use case, not co

2026-04-07·10 min read·llm, architecture, transformer, moe, ssm, deployment, decision
Use with AI
LLM Architecture Is a Decision, Not a Default

Most teams deploying LLMs use whatever GPT-4 or LLaMA variant is available without thinking carefully about whether that architecture is the right fit for the task. For many tasks, it is — decoder-only models are genuinely good at a wide range of things. But "decoder-only is the default" has become a habit rather than a decision, and it leads to real inefficiencies in production.

Different architectures were designed to solve different problems. Understanding the design constraints behind each family lets you choose based on your requirements rather than what's familiar. This post is that decision framework.

The Fundamental Split: Comprehension vs. Generation

Every major LLM architecture can be placed on a spectrum between two objectives.

Comprehension tasks — classification, entity recognition, similarity scoring, dense retrieval — require the model to produce a representation of the input that captures its meaning in full. These tasks benefit from bidirectional context: knowing what comes after a word is as useful as knowing what came before.

Generation tasks — text completion, instruction-following, summarization, agentic workflows — require the model to produce new content token by token. These tasks use causal (left-to-right) attention: each token can only attend to previous tokens, because future tokens don't exist yet when the current one is generated.

This split explains the first and most important architectural choice. If your task is discriminative, encoder-only models have a structural advantage. If your task is generative, decoder-only models are the right starting point. Encoder-decoder models try to do both and pay a cost in complexity.

Decoder-Only: GPT Family

Causal attention means each token only attends to the tokens before it. The training objective is next-token prediction: given the sequence so far, predict the next token. This is a simple, scalable self-supervised objective that works on any text data without human labeling.

The reason decoder-only models dominate generative tasks is that this training objective naturally produces in-context learning. By learning to continue text, the model learns to recognize patterns in the input and extend them — which is functionally what "following instructions" is. When you write a system prompt + user query, you're providing a context that the model completes. Few-shot examples work because they fit the same pattern.

When to use: generation, instruction-following, code completion, agentic workflows, multi-turn conversation. Any task where the output is an open-ended continuation of the input.

Weakness: inefficient for tasks that benefit from bidirectional context. A decoder-only model can technically do classification by generating a label, but it's processing the input with a structural handicap — it can't attend forward in the sequence. The representations it builds are optimized for prediction, not for capturing global input semantics.

Representative models: GPT-4, LLaMA 3, Mistral, Claude, Gemini.

Encoder-Only: BERT Family

Bidirectional attention means every token attends to every other token in the sequence simultaneously. The training objective is masked language modeling: randomly mask tokens and predict the masked tokens from context on both sides. This produces representations that capture full bidirectional context for each token.

For tasks where you need a fixed-size representation of the whole input — classification, semantic similarity, dense retrieval — encoder-only models produce better representations at lower inference cost than decoder-only models. The final hidden state or a pooled representation over all tokens captures global semantics more efficiently than a causal model can.

When to use: text classification, named entity recognition, semantic similarity scoring, sentence embeddings, dense passage retrieval (bi-encoders in RAG systems). Any task where the output is a label or vector derived from understanding the whole input.

Weakness: cannot generate. An encoder-only model has no mechanism for producing new tokens — it only produces representations. If your task requires any output beyond a fixed label or embedding vector, you need a different architecture.

Inference advantage: encoder-only models are typically much smaller (BERT-base is 110M parameters; BERT-large is 340M) and faster at inference than decoder-only models. For high-throughput classification pipelines, this difference compounds significantly.

Representative models: BERT, RoBERTa, DeBERTa, E5, BGE (the latter two are encoder-only models fine-tuned specifically for embeddings and retrieval).

Encoder-Decoder: T5 and BART

Encoder-decoder architectures separate input comprehension from output generation. The encoder processes the full input with bidirectional attention and produces a representation. The decoder, using causal attention + cross-attention into the encoder's representation, generates the output sequence.

The design premise is that some tasks have a natural input→output structure where full comprehension of the input should precede generation of the output. Translation is the prototypical case: understand the source sentence in full, then generate the target sentence. Summarization follows the same structure.

When to use: neural machine translation, abstractive summarization, structured extraction where the input is long and complex (e.g., extract specific fields from a lengthy document), question answering with long passages. Tasks where separating comprehension from generation is the natural framing.

Tradeoff: more complex than either encoder-only or decoder-only models. Requires running two stacks. In practice, decoder-only models with sufficient scale have largely closed the quality gap on summarization and structured extraction tasks, which has reduced encoder-decoder adoption for new deployments. T5-family models still have strong utility for tasks where you have abundant parallel data (input-output pairs) and want to fine-tune a smaller model efficiently.

Representative models: T5, FLAN-T5, BART, mT5.

Mixture of Experts: Sparse Activation

Mixture of Experts (MoE) is less an entirely different architecture than a modification to the feed-forward layers inside a transformer. Standard transformers pass every token through every parameter in every layer. MoE replaces dense feed-forward layers with a set of "expert" feed-forward networks and a router that selects which experts handle each token.

The result: total parameter count scales independently from inference compute. A model can have 100B parameters but activate only 20B per forward pass if each token routes to 2 out of 8 experts per layer. Mixtral 8x7B has ~47B total parameters but uses ~13B per token in inference — roughly matching a 13B dense model's compute cost while having access to the representational capacity of a much larger model.

When to use: when you need the quality of a large-parameter model but have inference budget constraints. MoE is particularly useful for serving a model across many different task types, since different experts can specialize in different domains. GPT-4 is widely rumored (but not confirmed) to use an MoE architecture for this reason.

Tradeoff: expert routing adds communication overhead in distributed settings. If experts live on different GPUs or nodes, token dispatch requires inter-device communication per layer. For single-GPU inference, this overhead is small. For large-scale distributed deployments, routing overhead can be significant and requires careful systems engineering.

Load balancing is a training challenge: without explicit load-balancing loss terms, the router learns to send most tokens to a small subset of experts, defeating the purpose of MoE. Training MoE models reliably requires additional loss terms that encourage uniform expert utilization.

Representative models: Mixtral 8x7B and 8x22B, DeepSeek-V2, Grok (rumored), GPT-4 (rumored).

State Space Models: Linear Sequence Modeling

The transformer's core weakness — quadratic attention cost with sequence length — is what state space models (SSMs) address directly. Models like Mamba replace attention with a recurrent computation that scales linearly with sequence length. Each token updates a fixed-size hidden state, and that hidden state is used to process the next token.

This changes what sequence lengths are feasible. Quadratic attention means processing a 32k-token sequence costs roughly 64x more compute than a 4k-token sequence (8x longer → 64x more attention operations). Linear SSMs reduce this to 8x — the sequence length multiplier directly. For tasks involving very long documents, genomic sequences, audio, or any domain where sequences are measured in tens or hundreds of thousands of tokens, this is not a marginal improvement.

When to use: long sequences where transformer attention is computationally prohibitive at the hardware budget available. Real-time inference scenarios with strict latency requirements, because SSMs have O(1) inference cost per token (they maintain a fixed hidden state, not a growing KV cache). Applications that need constant memory footprint during generation.

Tradeoff: SSMs are still catching up to transformers on complex reasoning tasks. The hidden state in an SSM is a fixed-size compression of all previous context. Transformers maintain explicit access to all previous tokens via the KV cache. For tasks that require precise recall of specific information from long context — legal document review, code analysis, multi-document synthesis — transformers with long context windows still tend to outperform SSMs of similar scale. The quality gap is narrowing but not closed as of early 2026.

Representative models: Mamba, Mamba-2, RWKV, Jamba (a hybrid attention + SSM architecture).

Retrieval-Augmented Architectures

RAG is typically presented as a pipeline technique rather than an architectural choice, but treating retrieval as an architecture decision rather than a preprocessing step changes how you design the system.

In a RAG setup, the model's parametric knowledge (baked into weights at training time) is supplemented by retrieved documents at inference time. This matters for two distinct reasons. First, knowledge can be updated without retraining — add new documents to the retrieval index. Second, the model can cite sources, making its reasoning auditable in regulated environments.

When to use: knowledge-intensive tasks where the base model's training data is insufficient (domain-specific knowledge), outdated (anything requiring current information), or where auditability requires traceable sources. Financial compliance, legal research, and medical applications are primary use cases — not because RAG is "safer," but because the retrieved context is visible and verifiable.

The architectural distinction is that retrieval happens inside the inference loop, not before it. In multi-step workflows, retrieval can be triggered multiple times with different queries based on what the model has already established. This is architecturally different from naively prepending a retrieved passage to a prompt.

The Decision Framework

Given your use case, here is where to start:

Your task is classification, entity recognition, or producing embeddings: start with an encoder-only model (DeBERTa, E5, BGE). Only move to decoder-only if the task requires understanding instructions or nuance that an encoder model handles poorly after fine-tuning.

Your task is generation, instruction-following, or agentic: start with decoder-only. Choose model size based on your latency and cost constraints. If you need larger capacity at fixed compute, evaluate MoE options.

Your task involves long documents (>32k tokens) and you're compute-constrained: evaluate SSMs or hybrid architectures. If you need precise recall of specific facts within long documents, prefer long-context transformers. If you need pattern recognition or summarization over long sequences, SSMs are worth benchmarking.

Your task requires structured output from complex inputs (translation, structured extraction with long inputs): T5-family models remain competitive for tasks where you have fine-tuning data and want a lightweight deployed model. For zero-shot or few-shot tasks, modern decoder-only models have largely closed the gap.

Your knowledge base changes frequently or auditability is required: treat retrieval as an architectural component. Model selection is then about the generator, not the complete solution.

One last note on MoE: if you're choosing between a dense model and an MoE model at a similar inference cost, the MoE model almost always wins on quality. The additional parameter capacity is effectively free at inference time. The tradeoff to watch is training stability and hosting complexity, not quality.

Architecture selection is not a one-time decision. As your use case evolves — longer context requirements, new modalities, auditability requirements — your architecture choices should evolve with it. The goal is to match the design constraints of the architecture to the requirements of the task, not to use the most familiar or most recently hyped option.

Want this implemented in your workflow?

I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call — no pitch, just a focused conversation about your situation.

I publish one post like this per month. Join AI Command Room and I'll send it directly to you.