Pedram Agand
← Writing
ML / AI

LLMs Are Stateless. Here Is How to Build Systems That Are Not.

MemGPT and AutoGen exposed the same problem from different angles: longhorizon AI workflows need persistent state and coordination. Here are the patterns that h

2026-04-07·10 min read·agents, memory, architecture, multi-agent, production, llm
Use with AI
LLMs Are Stateless. Here Is How to Build Systems That Are Not.

Every LLM inference call starts from scratch. The model has no memory of previous calls, no awareness that a workflow is in progress, and no concept of state that persists between requests. This is a feature for isolated tasks and a serious constraint for everything else.

One-shot classification, single-document summarization, and simple Q&A work fine with stateless models. The workflows that matter most in regulated industries — multi-step compliance review, iterative document processing, long-running agent pipelines — do not. They require state: what was established in step 1 needs to be available at step 7. Decisions made in one turn need to constrain the next.

Two projects tackled this problem and produced patterns worth understanding. MemGPT (now Letta) approached it from the memory side. AutoGen approached it from the coordination side. Neither tool is the point. The problems they exposed and the patterns they introduced are.

The Constraint Is Not Context Length

The instinctive response to "LLMs forget" is "use a bigger context window." The reasoning: if the model can see more tokens, it can hold more state. This is technically true and practically insufficient.

Attention is O(n²) with respect to sequence length. Doubling your context window quadruples the compute cost per token. At 128k tokens, even modest usage patterns become expensive. At 1M tokens, cost-per-call for a real workflow becomes a budget line item.

Cost aside, context quality degrades with length. Models attend disproportionately to early and recent tokens. Information buried in the middle of a 200k token context is attended to less reliably than the same information in a focused 4k context. You can observe this empirically: retrieval accuracy on long-context tasks drops measurably as context length grows beyond the model's effective attention range, which is consistently shorter than its technical maximum.

The conclusion: you cannot solve the stateless problem by stuffing everything into the context window. You need a different architecture.

The Memory Management Pattern

MemGPT's central insight was to treat context as a managed resource rather than a fixed buffer. The analogy to operating system virtual memory is precise: a process cannot hold all data in RAM simultaneously, so the OS pages data in and out of physical memory on demand. MemGPT applied the same structure to LLM context.

The architecture has two memory tiers:

Main memory (in-context): A structured block that always occupies part of the context window. It holds the model's current working state — a persona definition, a summary of the ongoing task, recent conversation turns, and a small set of facts the model needs for immediate reasoning. This block is bounded and always present.

Archival memory (external store): A persistent, searchable store outside the context window. It holds everything the workflow has produced or retrieved that might be needed later — past conversation summaries, retrieved documents, intermediate outputs, established facts from earlier steps. The model queries this store explicitly when it needs information that isn't in main memory.

The critical mechanism: the model itself decides what to move between tiers. It is given tools to write to archival memory, search archival memory, and update its main memory block. The LLM acts as its own memory manager. When the current task requires information from a previous step, the model issues a retrieval call, reads the result into main memory, and continues reasoning.

This is the pattern, stripped of the specific tool: a bounded working memory block + an external persistent store + explicit retrieval operations that the model controls.

What This Enables

A workflow that runs 50 steps does not accumulate 50 steps of history in the context window. It maintains a compact working memory that represents the current state of the task, and retrieves historical information on demand. The context window stays bounded regardless of workflow length.

For compliance workflows specifically, this maps naturally to how human reviewers work. A compliance analyst reviewing a 200-page document does not hold the entire document in active attention. They maintain a mental model of what they've found, refer back to specific sections when needed, and build up findings incrementally. The memory management pattern implements this operationally.

The Coordination Pattern

AutoGen's insight was different. Rather than solving how one model manages state across time, it addressed how multiple specialized models coordinate to solve problems that require different capabilities.

The core abstraction: agents as actors in a conversation. Each agent has a defined role — one generates code, one reviews it, one executes it and reports results. They communicate through a structured conversation that each agent can read and contribute to. The conversation history is the shared state.

Three coordination patterns emerge from this architecture:

Sequential pipeline: Agent A produces output, which becomes input for Agent B. Simple, predictable, testable. Works well when tasks decompose cleanly into stages. Fails when a downstream agent needs to send information back upstream — sequential pipelines have no feedback path.

Critic-actor loop: One agent generates (the actor), another evaluates (the critic). The critic's evaluation goes back to the actor, which revises. This continues for a fixed number of rounds or until the critic approves. The pattern implements iterative refinement without human involvement at each step.

Human-in-the-loop as a first-class agent: AutoGen treats human input as one participant in the conversation, not an external override. A UserProxyAgent can represent a human who approves decisions, provides missing information, or redirects the workflow. This is not a fallback — it is a design choice about where in the coordination flow human judgment belongs.

Why Role Separation Matters

Giving a single model both generation and evaluation responsibility produces lower quality than separating them. The evaluation model can apply stricter criteria because it has no investment in the output it's reviewing. The generation model can be less conservative because it knows its output will be checked.

This mirrors how regulated industries already structure work. A loan underwriter generates a recommendation. A separate compliance officer reviews it. The roles are separated specifically because the reviewer's incentives and perspective differ from the generator's. Agent role separation implements the same structure.

Where Both Patterns Break

Neither pattern is free. Understanding the failure modes is as important as understanding the mechanisms.

Latency compounds. A single LLM call might take 500ms. A 10-step agent workflow with retrieval operations at each step can take 30–60 seconds end-to-end. For any workflow with latency requirements, you need to measure the full chain, not individual calls. Retrieval adds round-trips. Agent coordination adds turns. Budget for this explicitly.

Cost scales with coordination. Each agent turn is a separate inference call. A critic-actor loop that runs 5 rounds before approval generates 10+ inference calls for what appears to be one task. Archival memory retrieval adds embedding calls on top. At scale, these patterns can cost an order of magnitude more than a naive single-call approach.

Failure propagates and amplifies. In a sequential pipeline, an error in step 3 contaminates every downstream step. The model at step 7 receives corrupted input and reasons confidently from it. Without explicit validation at each handoff point, the first error produces cascading failures that can be hard to trace back to their origin.

Retrieval is not perfect. The memory management pattern depends on the model retrieving the right information from archival memory when it needs it. If the retrieval system returns the wrong document, the model reasons from wrong context with no awareness that it has the wrong context. Silent retrieval failures are more dangerous than visible exceptions.

What Practitioners Should Build

The takeaway from these patterns is not "use MemGPT" or "use AutoGen." Both tools have evolved significantly — MemGPT is now Letta, AutoGen has been rewritten multiple times, and the landscape will look different again in 18 months. The patterns, however, are durable because the problems are durable.

For long-horizon workflows, implement explicit memory tiers. Define what belongs in your bounded working memory (current task state, active constraints, immediate context) and what belongs in external storage (history, retrieved documents, intermediate outputs). Give your system explicit operations for moving information between tiers. Do not rely on context window size to solve state management.

Use structured schemas for working memory. An unstructured accumulation of text in your "working memory" block degrades quickly. Define a schema: what fields does the current task state have, what types do they hold, how do they get updated. A structured representation is more reliable to read, update, and reason over than a free-form narrative.

Separate generation from evaluation for high-stakes decisions. If your workflow produces a recommendation, a classification, or a decision that triggers downstream action, add a separate evaluation step. Make the evaluation model's prompt explicitly different from the generation prompt — it should be looking for errors, missing information, and violated constraints, not confirming the output. For compliance review pipelines, this separation is directly analogous to the four-eyes principle already required by most regulatory frameworks.

Design human-in-the-loop positions before you need them. The mistake is treating human review as an afterthought — a manual override that operators trigger when something goes wrong. The better design: identify in advance which decision gates require human judgment, and build explicit pause points into the workflow where the agent surfaces its state and waits for approval before continuing. In document processing pipelines, this typically means: before any external action (sending a notification, filing a report, triggering a downstream system), surface a summary and require explicit approval.

Add validation at handoff points. Every time one agent (or one step) passes output to the next, validate the output against a schema before passing it. A structured validation step — does this output have the required fields, do the values fall within expected ranges, are there explicit error flags — catches failures before they propagate. This is the difference between a pipeline that fails loudly at step 3 and one that silently produces wrong answers at step 7.

The Regulated-Industry Connection

Compliance review, audit trail generation, multi-step document analysis — these workflows share three properties: they span many steps, they require persistent state across those steps, and the cost of a silent failure is high.

The memory management pattern directly addresses the span problem: a compliance workflow that processes 50 documents in sequence cannot hold all 50 in context simultaneously. It needs a state store that persists findings, a retrieval mechanism to access previous decisions when they're relevant to the current document, and a bounded working memory that represents what the reviewer currently knows.

The coordination pattern directly addresses the quality problem: a single model generating and self-reviewing a compliance determination produces worse results than a generation model and a separate evaluation model with different instructions. The separation is not cosmetic — it changes what the evaluation model attends to.

These patterns are not specific to any tool. They are the architecture of AI systems that handle real work reliably. The tools implementing them will change. The problems they solve will not.

Want this implemented in your workflow?

I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call — no pitch, just a focused conversation about your situation.

I publish one post like this per month. Join AI Command Room and I'll send it directly to you.