Google's LangExtract Just Solved LLM Hallucinations
LangExtract is a Google opensource Python library (~10,000 lines) that uses LLMs to extract structured, sourcegrounded information from unstructured text.
I spent a week reading through LangExtract's 10,000-line codebase because the paper didn't explain the alignment trick — and the alignment trick is the whole point. Most LLM extraction pipelines produce entities with no provenance: you get a result, but you can't verify where it came from in the source document. LangExtract solves this at the character level, and the engineering behind it is worth understanding if you're building anything that requires auditable AI output.
What It Does
LangExtract extracts user-defined structured entities from unstructured text and maps every extraction back to its exact character position in the source. It's not limited to classic NER categories (PERSON, ORG, LOCATION). You define what to extract via natural language instructions and a few annotated examples — the library handles chunking, batched inference, output parsing, and alignment.
Every extraction produces:
extraction_class— a user-defined type ("medication","character","urgency_indicator")extraction_text— the verbatim span from the source documentattributes— key-value metadata ({"dosage": "10mg", "frequency": "daily"})char_interval— exact start/end character positions in the original textalignment_status— confidence of grounding:exact,fuzzy,partial, orunaligned
The alignment_status field is what makes this useful for regulated contexts. An unaligned result tells you the LLM hallucinated something not present in the source. You can filter those out programmatically or flag them for human review.
Basic Usage
import langextract as lx
result = lx.extract(
text_or_documents="Start Lisinopril 10mg once daily for hypertension.",
prompt_description="Extract medications, dosages, and frequencies.",
examples=[...],
model_id="gemini-2.5-flash",
)The examples parameter is how you define the schema — no JSON schema files required. You provide annotated examples and LangExtract bootstraps the extraction schema from them. This is significantly lower friction than defining formal schemas upfront.
The 6-Stage Pipeline
Input Text/URL
→ [1] Tokenization (core/tokenizer.py)
→ [2] Chunking (chunking.py)
→ [3] Prompt Construction (prompting.py)
→ [4] Batched LLM Inference (providers/gemini.py)
→ [5] Resolution + Alignment (resolver.py)
→ [6] Multi-Pass Merge + Emission (annotation.py)
→ AnnotatedDocument (with char-level grounding)The single entry point is extract() in extraction.py. It orchestrates the factory, annotator, and resolver. Here's what each stage actually does.
Stage 1 — Tokenization
LangExtract ships two tokenizers. The default RegexTokenizer splits text into WORD, NUMBER, and PUNCTUATION tokens using a compiled regex — fast for Latin-script text. The UnicodeTokenizer uses the regex library's \X grapheme cluster pattern for CJK, Thai, Hangul, and emoji.
Critically, this is not an LLM tokenizer. It's used exclusively for alignment and sentence boundary detection — completely separate from whatever tokenization the LLM uses internally. Every token stores its CharInterval(start_pos, end_pos) back to the original string. This is the foundation of source grounding.
Stage 2 — Chunking
Long documents get split into overlapping chunks so each LLM call fits within the context window. The chunker prefers breaking at line boundaries (tracked during tokenization) to avoid splitting mid-sentence.
Stage 3 — Prompt Construction
Each chunk gets wrapped in a structured prompt that includes the extraction instructions, the few-shot examples, and the chunk text. The prompt format is designed to elicit structured JSON output that the resolver can parse reliably.
Stage 4 — Batched LLM Inference
Chunks are sent to the LLM in batches. The default provider is Gemini. The library handles rate limiting, retries, and response parsing at this layer.
Stage 5 — The Alignment Trick (The Important Part)
This is where LangExtract earns its position. After the LLM returns extracted entities, the resolver maps each extraction_text back to its character position in the source using a 3-tier alignment strategy:
- Exact match —
SequenceMatcherfinds the extraction verbatim in the source.alignment_status: exact. - Partial match — if the exact span isn't found, try matching a substring or normalized form.
alignment_status: partial. - Fuzzy sliding window — a sliding window scan using a
Counterpre-check (fast character frequency filter) followed bySequenceMatcher.ratio()for similarity scoring.alignment_status: fuzzy. - Unaligned — nothing matched above the threshold. The extraction is flagged as
alignment_status: unaligned— which means the LLM produced text not present in the source.
The Counter pre-check is an efficiency trick: before running the expensive SequenceMatcher comparison, it checks whether the character frequency distributions of the candidate and window are similar enough to be worth comparing. This prunes most non-matches cheaply.
Stage 6 — Multi-Pass Merge
When num_extraction_passes > 1, the library runs N independent extraction passes over the same document and merges results using a first-pass-wins strategy to avoid duplicates:
def _merge_non_overlapping_extractions(all_extractions):
merged = list(all_extractions[0]) # first pass wins
for pass_extractions in all_extractions[1:]:
for extraction in pass_extractions:
if not any(_extractions_overlap(extraction, existing)
for existing in merged):
merged.append(extraction)
return mergedOverlap is determined by character interval intersection. Multiple passes improve recall — the LLM may miss entities on the first pass that it catches on the second.
LangExtract vs Traditional NER
| Capability | Traditional NER | LangExtract |
|---|---|---|
| Entity categories | Fixed (pre-trained) | User-defined via examples |
| Schema definition | Model retraining | Natural language + few-shot examples |
| Source grounding | None | Character-level, every extraction |
| Hallucination detection | None | unaligned status flag |
| Multi-pass recall | No | Yes (configurable passes) |
| Multilingual support | Model-dependent | Unicode tokenizer for CJK/Thai/etc. |
Where This Matters
The use cases where alignment_status pays off are any domain where you can't trust unverifiable AI output: clinical notes extraction, legal contract review, financial document parsing, compliance evidence gathering. The unaligned flag gives you a systematic way to route extractions to human review rather than hoping the model didn't hallucinate.
Schema bootstrapping from examples (rather than requiring formal JSON schema definitions) also significantly reduces the barrier to deployment. Domain experts can annotate examples; they don't need to write schema files.
The library is early — the API will change — but the core idea of character-level grounding as a first-class output is the right direction for production LLM extraction pipelines.
Want to go deeper?
I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call — no pitch, just a focused conversation about your situation.
I make videos like this when I have something worth explaining. Join AI Command Room and I'll let you know when the next one ships.