ML / AI

Beyond CI/CD: A Technical Guide to Continuous Calibration for LLMs in Production

In traditional software development, the "Input" is predictable, and the "Output" is deterministic.

2026-01-26·4 min read·llm

Use with AI

Beyond CI/CD: A Technical Guide to Continuous Calibration for LLMs in Production

In traditional software development, the "Input" is predictable, and the "Output" is deterministic. But in LLM-based applications, we face a dual-uncertainty problem:

User Uncertainty: Users interact with natural language interfaces in unpredictable ways (unlike clicking a fixed button).
Model Uncertainty: The API is non-deterministic and probabilistic.

If you deploy full automation on Day 1, you create a "Trust Debt." When the model hallucinates or fails, you lose the user immediately.

To solve this, we need to move from CI/CD (Continuous Integration/Deployment) to CC/CD (Continuous Calibration/Continuous Development).

What is CC/CD?

Continuous Development (CD): The process of scoping data and setting up the infrastructure.
Continuous Calibration (CC): The process of logging real interactions, analyzing semantic drift, and applying fixes to the prompt or RAG pipeline rather than just the code.

But how do you actually implement this? It requires a fundamental shift in how we view Agency.

The Agency-Control Lifecycle (The V1-V3 Framework)

In my experience building evaluation systems, the biggest trap is treating "Autonomy" as a default setting. Autonomy must be earned. We use a three-stage lifecycle to safely graduate a model from a toy to a tool.

V1: High Control, Low Agency (The Copilot)

Goal: Data Collection & Calibration.
Mechanism: The model provides suggestions (drafts), but the human must "Tab" to accept or edit.
Metric: We measure Acceptance Rate. If a user accepts the draft without editing, that is a positive signal. If they edit it heavily, we log the delta between the model's output and the user's final version. This delta becomes our training data for V2.
Example: A Marketing Assistant that only drafts email subject lines.

V2: Balanced (The Assistant)

Goal: Efficiency & Review.
Mechanism: The model generates larger blocks of work (e.g., full email body, entire function unit tests). The human reviews and approves.
Metric: Correction Density. How many edits are required per 100 tokens generated?
Example: A Coding Assistant that generates unit tests for your code, but waits for you to run them.

V3: High Agency (The Agent)

Goal: Scale & Speed.
Mechanism: The system acts autonomously. It launches the campaign or opens the Pull Request.
Metric: Outcome Success. Did the ad convert? Did the PR pass the build?
Note: You only reach V3 once V2 metrics stabilize within a safe statistical threshold.

The Technical Deep Dive: From Unit Tests to Statistical Evals

In standard engineering, a unit test checks: assert result == expected.

In AI engineering, this is impossible because the "expected" result is a distribution, not a single string.

We must shift to Statistical Evals.

1. Defining "Helpfulness" (The Rubric Strategy)

"Helpfulness" is too vague to measure. You must break it down into measurable dimensions using Rubrics. When designing these at Amazon or for clients, I always collaborate with SMEs (Subject Matter Experts) to define the specific dimensions of success.

Example Rubric for a Customer Support Bot:

Instead of a single 1-5 score, we evaluate three dimensions:

Empathy (0/1): Did the model acknowledge the user's frustration?
Factuality (0/1): Is the solution grounded in the retrieved context?
Resolution Step (0/1): Did it provide a clear next step?

Composite Score = (Empathy + Factuality + Resolution) / 3

2. Measuring The "Delta" (Semantic Similarity)

When a user rejects a model's output in V1 and writes their own, that is the most valuable signal you possess.

The Old Way: Check if the strings match (Levenshtein distance). This is useless for semantic tasks.
The CC/CD Way: Use Embeddings (e.g., OpenAI text-embedding-3-small or similar). Calculate the Cosine Similarity between the Model Draft and the User Final Edit. If Similarity > 0.9: The model is calibrated.
If Similarity

Conclusion

Building LLM applications is not just about prompt engineering; it is about system engineering.

By adopting a CC/CD mindset, you stop treating the model as a black box that either "works" or "doesn't." You start treating it as a probabilistic component that requires constant calibration. You don't guess if your system is ready for autonomy-you let the data tell you when it has earned it.

Executive summary for compliance officers and engineering leads: Beyond CI/CD: Continuous Calibration for LLMs in Production on Fact AI Lab.

Want this implemented in your workflow?

I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call - no pitch, just a focused conversation about your situation.

Book a strategy call →Download the checklist →

I publish one post like this per month. Join AI Command Room and I'll send it directly to you.