Building Hallucination Detection for LLMs: My Open Source Journey Contributing to Instructor

Ruthvik Bandari — Thu, 25 Dec 2025 09:29:24 GMT

How I identified a critical gap in one of Python's most popular LLM libraries and built two features to solve it.

Introduction

When I started exploring ways to contribute to open source, I wanted to find a project where I could make a real impact - not just fix typos or update documentation, but solve a genuine problem that developers face every day.

After weeks of research, I found that opportunity in Instructor, a Python library with 12,000+ GitHub stars and 3+ million monthly downloads. What I discovered was a critical gap that affects every developer working with LLMs: there was no way to know if the extracted data was actually true.

This is the story of how I researched, designed, and implemented two complementary features - GroundCheck and Confidence Scoring - that together provide a complete solution for LLM extraction reliability.

Part 1: The Research Phase

Understanding the Landscape

My journey began with a simple question: What problems do developers face when extracting structured data from LLMs?

I spent time:

Reading GitHub issues across popular LLM libraries
Analyzing discussions on Reddit, Twitter, and Discord
Reviewing academic papers on LLM reliability
Testing existing solutions and their limitations

The Gap I Discovered

Instructor is brilliant at what it does - it uses Pydantic to validate that LLM outputs match your expected schema. But I noticed something crucial:

Instructor validates STRUCTURE, not TRUTH.

Here's what I mean:

python

# Source text
source = "Invoice #12345 from Acme Corp. Total: $500"

# LLM extraction (passes all Pydantic validation!)
extracted = {
    "invoice_number": "12345",     # ✅ Actually in source
    "vendor": "Acme Corp",         # ✅ Actually in source
    "total": 500,                  # ✅ Actually in source
    "currency": "USD",             # ❌ HALLUCINATED - not in source!
    "payment_terms": "Net 30",     # ❌ HALLUCINATED - not in source!
}

Every field passes type validation. The JSON is perfectly formed. But two fields are completely fabricated by the LLM. In domains like healthcare, finance, or legal, this could be catastrophic.

Validating the Problem

Before writing any code, I needed to confirm this was a real problem, not just a theoretical concern. I found:

No existing solution in Instructor for source grounding verification
Multiple GitHub issues from users asking about extraction reliability
Academic research confirming LLM hallucination rates of 15-30% in extraction tasks
Real-world incidents where hallucinated data caused business problems

The problem was real. Now I needed to design a solution.

Part 2: Designing the Solution

Research: How Do Humans Verify Information?

I started by thinking about how humans verify extracted information:

Exact matching - Is this exact phrase in the document?
Fuzzy matching - Is something similar in the document? (handles typos, OCR errors)
Numeric matching - Is this number present? (handles format differences like $1,234.56 vs 1234.56)
Semantic matching - Is the meaning present, even if worded differently?

This became the foundation for GroundCheck.

Research: How Do We Know When LLMs Are Uncertain?

For the second feature, I researched how to measure LLM confidence:

Method	Description	Pros	Cons
Temperature	Controls randomness	Easy to use	Doesn't measure confidence
Self-consistency	Run N times, check agreement	Accurate	Expensive (N API calls)
Verbalized confidence	Ask "how confident are you?"	Simple	Often inaccurate
Token logprobs	Actual token probabilities	TRUE confidence	Requires parsing

I chose token logprobs because:

Zero extra API calls (data is already in the response)
Represents the model's actual internal confidence
Sub-millisecond processing time
No additional dependencies

The Two-Feature Architecture

I realized the best solution was two complementary features:

┌─────────────────────────────────────────────────────────┐
│                 LLM Extraction                          │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│  Feature 1: Confidence Scoring                          │
│  "How sure was the model when generating this?"         │
│  Uses: Token log probabilities                          │
│  Cost: Zero extra API calls                             │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│  Feature 2: GroundCheck                                 │
│  "Does this value actually exist in the source?"        │
│  Uses: Multi-strategy text matching                     │
│  Cost: Zero API calls (local processing)                │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│  Combined Reliability Score                             │
│  High confidence + Grounded = Reliable ✅               │
│  Low confidence OR Not grounded = Review needed ⚠️      │
└─────────────────────────────────────────────────────────┘

Part 3: Implementation

Feature 1: GroundCheck

Core Architecture

python

class GroundCheck:
    """Verify extracted data is grounded in source text."""

    def verify(self, source_text: str, extracted_data: dict) -> VerificationResult:
        # For each field, try verification strategies in order:
        # 1. Exact match (fastest, highest confidence)
        # 2. Numeric match (handles formatting)
        # 3. Fuzzy match (handles typos/OCR)
        # 4. Semantic match (handles paraphrasing)
        pass

Verification Strategies

1. Exact Matching

python

def _exact_match(self, source: str, value: str) -> tuple:
    """Case-insensitive verbatim search."""
    idx = source.lower().find(value.lower())
    if idx != -1:
        return (0.99, evidence, (idx, idx + len(value)))
    return (0.0, None, None)

2. Numeric Matching

python

def _numeric_match(self, source: str, value: Any) -> tuple:
    """Handle $1,234.56 vs 1234.56 variations."""
    # Extract all numbers from source
    # Compare with tolerance (0.01 for floats)
    # Return confidence 0.90-0.95

3. Fuzzy Matching

python

def _fuzzy_match(self, source: str, value: str) -> tuple:
    """Using rapidfuzz for typos/OCR errors."""
    # Sliding window over source
    # Token sort ratio comparison
    # Return best match with confidence

4. Semantic Matching (Optional)

python

def _semantic_match(self, source: str, value: str) -> tuple:
    """Embedding-based similarity."""
    # Encode source sentences and value
    # Cosine similarity comparison
    # Return confidence 0.70-0.90

Key Design Decisions

Cascading fallback: Try fastest method first, fall back to slower methods only if needed
Optional dependencies: Core features work without rapidfuzz or sentence-transformers
Field-level results: Know exactly which fields are problematic
Evidence extraction: Show what matched in the source

Feature 2: Confidence Scoring

The Math Behind It

LLMs generate tokens one at a time, each with a probability distribution. The logprob is the log of that probability:

logprob = -0.01  →  probability = e^(-0.01) = 0.99  →  Very confident
logprob = -1.00  →  probability = e^(-1.00) = 0.37  →  Somewhat confident
logprob = -3.00  →  probability = e^(-3.00) = 0.05  →  Not confident

Implementation

python

class ConfidenceScorer:
    """Calculate confidence from token logprobs."""

    def score(self, response: Any, extracted_data: dict) -> ConfidenceResult:
        # 1. Extract logprobs from response
        tokens = self.extract_logprobs_openai(response)

        # 2. Map tokens to fields
        field_tokens = self.map_tokens_to_fields(tokens, extracted_data)

        # 3. Calculate per-field confidence (geometric mean)
        for field, tokens in field_tokens.items():
            probabilities = [t["probability"] for t in tokens]
            confidence = geometric_mean(probabilities)

        # 4. Return results with interpretation
        return ConfidenceResult(...)

Why Geometric Mean?

I chose geometric mean over arithmetic mean because it's more conservative:

python

# Arithmetic mean: [0.99, 0.99, 0.10] → 0.69
# Geometric mean:  [0.99, 0.99, 0.10] → 0.46

# The geometric mean correctly penalizes that one uncertain token

Performance Optimization

The entire scoring process:

Zero API calls - Uses data already in the response
< 1ms processing - Simple math operations
Zero dependencies - Pure Python standard library

Part 4: Integration with Instructor

Making It Feel Native

I wanted users to be able to use these features as naturally as any other Instructor feature:

python

# Before: Just extraction
result = client.chat.completions.create(
    response_model=Invoice,
    messages=[...]
)

# After: Extraction + Reliability
from instructor import verify_extraction, score_confidence

result = client.chat.completions.create(
    response_model=Invoice,
    messages=[...],
    logprobs=True  # Enable for confidence scoring
)

# Check confidence
confidence = score_confidence(response, result.model_dump())

# Check grounding
grounding = verify_extraction(source_text, result.model_dump())

# Combined reliability
is_reliable = confidence.overall >= 0.85 and grounding.is_reliable

Multiple Integration Patterns

I implemented several ways to use GroundCheck:

1. Direct Function Call

python

result = verify_extraction(source_text, extracted_data)

2. Decorator Pattern

python

@with_grounding(source_text=document, threshold=0.8)
def extract_invoice():
    return client.chat.completions.create(...)

3. Wrapper Class

python

grounded = GroundedExtractor(client)
result = grounded.extract(response_model=Invoice, source_text=doc, ...)

4. Pydantic Validator

python

class Invoice(BaseModel):
    vendor: Annotated[str, BeforeValidator(grounding_validator(source))]

Part 5: Testing Strategy

Test Categories

Category	Tests	Purpose
Basic Functionality	10	Core verification logic
Edge Cases	8	Empty inputs, Unicode, special chars
Real-world Scenarios	6	Invoice, medical, legal documents
Integration	6	Decorator, wrapper, validator
Performance	4	Processing time < 10ms
Error Handling	4	Graceful degradation

Example Test: Medical Record Hallucination

python

def test_medical_record(self):
    """High-stakes scenario test."""
    clinical_note = """
    Patient: John Smith, DOB 03/15/1980
    Vital Signs: BP 145/92, HR 88
    Plan: Start aspirin 81mg daily
    """

    extracted = {
        "patient_name": "John Smith",
        "medication": "aspirin 81mg",
        "allergies": "Penicillin",      # HALLUCINATED!
        "surgery_scheduled": "Yes",      # HALLUCINATED!
    }

    result = verify_extraction(clinical_note, extracted)

    # Must catch these dangerous hallucinations
    assert "allergies" in result.flagged_fields
    assert "surgery_scheduled" in result.flagged_fields

Results

============================= 38 passed in 4.48s =============================

All 38 tests pass consistently.

Part 6: Documentation

Good code deserves good documentation. I created:

Concept Documentation (docs/concepts/groundcheck.md, docs/concepts/confidence.md)
- Problem explanation
- Quick start guide
- API reference
- Best practices
Working Examples (examples/groundcheck/, examples/confidence/)
- Basic usage
- Real-world scenarios
- Mock mode for testing without API key
Inline Documentation
- Comprehensive docstrings
- Type hints throughout
- Usage examples in docstrings

Part 7: The Open Source Process

Forking and Setup

bash

# Fork the repository
gh repo fork 567-labs/instructor

# Clone and setup
git clone https://github.com/Ruthvik-Bandari/instructor.git
cd instructor
python -m venv venv
source venv/bin/activate
pip install -e ".[dev]"

Development Workflow

Create feature branch

bash

   git checkout -b feature/groundcheck-verification

Implement incrementally
- Core functionality first
- Tests as I go
- Documentation at the end
Code quality checks

bash

   ruff check instructor/groundcheck.py --fix
   ruff format instructor/groundcheck.py
   pytest tests/ -v

Submit PR with comprehensive description

Responding to Code Review

The Ellipsis bot provided automated review. Key feedback:

Issue	My Response
Duplicate fuzzy matching calls	Fixed: Store result for reuse
Hardcoded threshold in post_init	Fixed: Removed override
Wrong method label for complex fields	Fixed: Added AGGREGATE method

Part 8: Results and Impact

Contribution Statistics

Metric	Value
Lines of Code	2,297
Tests	38
Files Created	8
Documentation Pages	2
Examples	2

What Users Can Now Do

python

from instructor import (
    # Hallucination Detection
    GroundCheck,
    verify_extraction,
    HallucinationError,

    # Confidence Scoring
    ConfidenceScorer,
    score_confidence,
    ConfidenceLevel,
    LowConfidenceError,
)

# Complete reliability check
confidence = score_confidence(response, data)
grounding = verify_extraction(source_text, data)

if confidence.overall < 0.85:
    print(f"⚠️ Low confidence fields: {confidence.low_confidence_fields}")

if not grounding.is_reliable:
    print(f"⚠️ Hallucinated fields: {grounding.flagged_fields}")

Lessons Learned

1. Research Before Coding

Spending time understanding the problem space saved me from building the wrong thing.

2. Design for Integration

Features should feel native to the library, not bolted on.

3. Test Real Scenarios

Unit tests are good, but real-world scenario tests catch practical issues.

4. Document As You Go

Writing documentation helped me think through edge cases.

5. Optimize for Developer Experience

Zero extra API calls, optional dependencies, multiple integration patterns.

What's Next?

Potential future enhancements:

Async Support - async def verify() for high-throughput applications
Streaming Integration - Verify partial extractions as they stream
Custom Verification Strategies - Plugin architecture for domain-specific matching
Confidence Calibration - Historical accuracy tracking

Conclusion

What started as a desire to contribute to open source became a deep dive into LLM reliability. By identifying a real problem, researching solutions thoroughly, and implementing with care for developer experience, I was able to create something genuinely useful.

The combination of GroundCheck (hallucination detection) and Confidence Scoring (model certainty) provides a complete solution for knowing when to trust LLM extractions - critical for any production application.

PR #1968: github.com/567-labs/instructor/pull/1968

Ruthvik Bandari is a Master's student in Applied AI at Northeastern University. Connect with me on LinkedIn or GitHub.