Pedram Agand
← Videos

Google's LangExtract Just Solved LLM Hallucinations

LangExtract is a Google opensource Python library (~10,000 lines) that uses LLMs to extract structured, sourcegrounded information from unstructured text.

Use with AI

I spent a week reading through LangExtract's 10,000-line codebase because the paper didn't explain the alignment trick — and the alignment trick is the whole point. Most LLM extraction pipelines produce entities with no provenance: you get a result, but you can't verify where it came from in the source document. LangExtract solves this at the character level, and the engineering behind it is worth understanding if you're building anything that requires auditable AI output.

LangExtract on GitHub

What It Does

LangExtract extracts user-defined structured entities from unstructured text and maps every extraction back to its exact character position in the source. It's not limited to classic NER categories (PERSON, ORG, LOCATION). You define what to extract via natural language instructions and a few annotated examples — the library handles chunking, batched inference, output parsing, and alignment.

Every extraction produces:

  • extraction_class — a user-defined type ("medication", "character", "urgency_indicator")
  • extraction_text — the verbatim span from the source document
  • attributes — key-value metadata ({"dosage": "10mg", "frequency": "daily"})
  • char_interval — exact start/end character positions in the original text
  • alignment_status — confidence of grounding: exact, fuzzy, partial, or unaligned

The alignment_status field is what makes this useful for regulated contexts. An unaligned result tells you the LLM hallucinated something not present in the source. You can filter those out programmatically or flag them for human review.

Basic Usage

import langextract as lx

result = lx.extract(
    text_or_documents="Start Lisinopril 10mg once daily for hypertension.",
    prompt_description="Extract medications, dosages, and frequencies.",
    examples=[...],
    model_id="gemini-2.5-flash",
)

The examples parameter is how you define the schema — no JSON schema files required. You provide annotated examples and LangExtract bootstraps the extraction schema from them. This is significantly lower friction than defining formal schemas upfront.

The 6-Stage Pipeline

Input Text/URL
 → [1] Tokenization       (core/tokenizer.py)
 → [2] Chunking           (chunking.py)
 → [3] Prompt Construction (prompting.py)
 → [4] Batched LLM Inference (providers/gemini.py)
 → [5] Resolution + Alignment (resolver.py)
 → [6] Multi-Pass Merge + Emission (annotation.py)
 → AnnotatedDocument (with char-level grounding)

The single entry point is extract() in extraction.py. It orchestrates the factory, annotator, and resolver. Here's what each stage actually does.

Stage 1 — Tokenization

LangExtract ships two tokenizers. The default RegexTokenizer splits text into WORD, NUMBER, and PUNCTUATION tokens using a compiled regex — fast for Latin-script text. The UnicodeTokenizer uses the regex library's \X grapheme cluster pattern for CJK, Thai, Hangul, and emoji.

Critically, this is not an LLM tokenizer. It's used exclusively for alignment and sentence boundary detection — completely separate from whatever tokenization the LLM uses internally. Every token stores its CharInterval(start_pos, end_pos) back to the original string. This is the foundation of source grounding.

Stage 2 — Chunking

Long documents get split into overlapping chunks so each LLM call fits within the context window. The chunker prefers breaking at line boundaries (tracked during tokenization) to avoid splitting mid-sentence.

Stage 3 — Prompt Construction

Each chunk gets wrapped in a structured prompt that includes the extraction instructions, the few-shot examples, and the chunk text. The prompt format is designed to elicit structured JSON output that the resolver can parse reliably.

Stage 4 — Batched LLM Inference

Chunks are sent to the LLM in batches. The default provider is Gemini. The library handles rate limiting, retries, and response parsing at this layer.

Stage 5 — The Alignment Trick (The Important Part)

This is where LangExtract earns its position. After the LLM returns extracted entities, the resolver maps each extraction_text back to its character position in the source using a 3-tier alignment strategy:

  1. Exact matchSequenceMatcher finds the extraction verbatim in the source. alignment_status: exact.
  2. Partial match — if the exact span isn't found, try matching a substring or normalized form. alignment_status: partial.
  3. Fuzzy sliding window — a sliding window scan using a Counter pre-check (fast character frequency filter) followed by SequenceMatcher.ratio() for similarity scoring. alignment_status: fuzzy.
  4. Unaligned — nothing matched above the threshold. The extraction is flagged as alignment_status: unaligned — which means the LLM produced text not present in the source.

The Counter pre-check is an efficiency trick: before running the expensive SequenceMatcher comparison, it checks whether the character frequency distributions of the candidate and window are similar enough to be worth comparing. This prunes most non-matches cheaply.

Stage 6 — Multi-Pass Merge

When num_extraction_passes > 1, the library runs N independent extraction passes over the same document and merges results using a first-pass-wins strategy to avoid duplicates:

def _merge_non_overlapping_extractions(all_extractions):
    merged = list(all_extractions[0])  # first pass wins
    for pass_extractions in all_extractions[1:]:
        for extraction in pass_extractions:
            if not any(_extractions_overlap(extraction, existing)
                       for existing in merged):
                merged.append(extraction)
    return merged

Overlap is determined by character interval intersection. Multiple passes improve recall — the LLM may miss entities on the first pass that it catches on the second.

LangExtract vs Traditional NER

CapabilityTraditional NERLangExtract
Entity categoriesFixed (pre-trained)User-defined via examples
Schema definitionModel retrainingNatural language + few-shot examples
Source groundingNoneCharacter-level, every extraction
Hallucination detectionNoneunaligned status flag
Multi-pass recallNoYes (configurable passes)
Multilingual supportModel-dependentUnicode tokenizer for CJK/Thai/etc.

Where This Matters

The use cases where alignment_status pays off are any domain where you can't trust unverifiable AI output: clinical notes extraction, legal contract review, financial document parsing, compliance evidence gathering. The unaligned flag gives you a systematic way to route extractions to human review rather than hoping the model didn't hallucinate.

Schema bootstrapping from examples (rather than requiring formal JSON schema definitions) also significantly reduces the barrier to deployment. Domain experts can annotate examples; they don't need to write schema files.

The library is early — the API will change — but the core idea of character-level grounding as a first-class output is the right direction for production LLM extraction pipelines.

Want to go deeper?

I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call — no pitch, just a focused conversation about your situation.

I make videos like this when I have something worth explaining. Join AI Command Room and I'll let you know when the next one ships.