<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Hallucination Detection for LLMs]]></title><description><![CDATA[Hallucination Detection for LLMs]]></description><link>https://hallucination-detection-for-llms.hashnode.dev</link><generator>RSS for Node</generator><lastBuildDate>Thu, 18 Jun 2026 16:58:54 GMT</lastBuildDate><atom:link href="https://hallucination-detection-for-llms.hashnode.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Building Hallucination Detection for LLMs: My Open Source Journey Contributing to Instructor]]></title><description><![CDATA[How I identified a critical gap in one of Python's most popular LLM libraries and built two features to solve it.

Introduction
When I started exploring ways to contribute to open source, I wanted to find a project where I could make a real impact - ...]]></description><link>https://hallucination-detection-for-llms.hashnode.dev/building-hallucination-detection-for-llms-my-open-source-journey-contributing-to-instructor</link><guid isPermaLink="true">https://hallucination-detection-for-llms.hashnode.dev/building-hallucination-detection-for-llms-my-open-source-journey-contributing-to-instructor</guid><category><![CDATA[northeastern university cps]]></category><category><![CDATA[groundchecks]]></category><category><![CDATA[hallucinations]]></category><category><![CDATA[AI Hallucinations]]></category><category><![CDATA[contribution to open source]]></category><category><![CDATA[instructor]]></category><category><![CDATA[applied ai]]></category><category><![CDATA[AI]]></category><category><![CDATA[ML]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[neural networks]]></category><dc:creator><![CDATA[Ruthvik Bandari]]></dc:creator><pubDate>Thu, 25 Dec 2025 09:29:24 GMT</pubDate><content:encoded><![CDATA[<p><em>How I identified a critical gap in one of Python's most popular LLM libraries and built two features to solve it.</em></p>
<hr />
<h2 id="heading-introduction">Introduction</h2>
<p>When I started exploring ways to contribute to open source, I wanted to find a project where I could make a <strong>real impact</strong> - not just fix typos or update documentation, but solve a genuine problem that developers face every day.</p>
<p>After weeks of research, I found that opportunity in <strong>Instructor</strong>, a Python library with 12,000+ GitHub stars and 3+ million monthly downloads. What I discovered was a critical gap that affects every developer working with LLMs: <strong>there was no way to know if the extracted data was actually true</strong>.</p>
<p>This is the story of how I researched, designed, and implemented two complementary features - <strong>GroundCheck</strong> and <strong>Confidence Scoring</strong> - that together provide a complete solution for LLM extraction reliability.</p>
<hr />
<h2 id="heading-part-1-the-research-phase">Part 1: The Research Phase</h2>
<h3 id="heading-understanding-the-landscape">Understanding the Landscape</h3>
<p>My journey began with a simple question: <em>What problems do developers face when extracting structured data from LLMs?</em></p>
<p>I spent time:</p>
<ul>
<li><p>Reading GitHub issues across popular LLM libraries</p>
</li>
<li><p>Analyzing discussions on Reddit, Twitter, and Discord</p>
</li>
<li><p>Reviewing academic papers on LLM reliability</p>
</li>
<li><p>Testing existing solutions and their limitations</p>
</li>
</ul>
<h3 id="heading-the-gap-i-discovered">The Gap I Discovered</h3>
<p>Instructor is brilliant at what it does - it uses Pydantic to validate that LLM outputs match your expected schema. But I noticed something crucial:</p>
<p><strong>Instructor validates STRUCTURE, not TRUTH.</strong></p>
<p>Here's what I mean:</p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-comment"># Source text</span>
source = <span class="hljs-string">"Invoice #12345 from Acme Corp. Total: $500"</span>

<span class="hljs-comment"># LLM extraction (passes all Pydantic validation!)</span>
extracted = {
    <span class="hljs-string">"invoice_number"</span>: <span class="hljs-string">"12345"</span>,     <span class="hljs-comment"># ✅ Actually in source</span>
    <span class="hljs-string">"vendor"</span>: <span class="hljs-string">"Acme Corp"</span>,         <span class="hljs-comment"># ✅ Actually in source</span>
    <span class="hljs-string">"total"</span>: <span class="hljs-number">500</span>,                  <span class="hljs-comment"># ✅ Actually in source</span>
    <span class="hljs-string">"currency"</span>: <span class="hljs-string">"USD"</span>,             <span class="hljs-comment"># ❌ HALLUCINATED - not in source!</span>
    <span class="hljs-string">"payment_terms"</span>: <span class="hljs-string">"Net 30"</span>,     <span class="hljs-comment"># ❌ HALLUCINATED - not in source!</span>
}
</code></pre>
<p>Every field passes type validation. The JSON is perfectly formed. But two fields are completely fabricated by the LLM. In domains like healthcare, finance, or legal, this could be catastrophic.</p>
<h3 id="heading-validating-the-problem">Validating the Problem</h3>
<p>Before writing any code, I needed to confirm this was a real problem, not just a theoretical concern. I found:</p>
<ol>
<li><p><strong>No existing solution</strong> in Instructor for source grounding verification</p>
</li>
<li><p><strong>Multiple GitHub issues</strong> from users asking about extraction reliability</p>
</li>
<li><p><strong>Academic research</strong> confirming LLM hallucination rates of 15-30% in extraction tasks</p>
</li>
<li><p><strong>Real-world incidents</strong> where hallucinated data caused business problems</p>
</li>
</ol>
<p>The problem was real. Now I needed to design a solution.</p>
<hr />
<h2 id="heading-part-2-designing-the-solution">Part 2: Designing the Solution</h2>
<h3 id="heading-research-how-do-humans-verify-information">Research: How Do Humans Verify Information?</h3>
<p>I started by thinking about how humans verify extracted information:</p>
<ol>
<li><p><strong>Exact matching</strong> - Is this exact phrase in the document?</p>
</li>
<li><p><strong>Fuzzy matching</strong> - Is something similar in the document? (handles typos, OCR errors)</p>
</li>
<li><p><strong>Numeric matching</strong> - Is this number present? (handles format differences like $1,234.56 vs 1234.56)</p>
</li>
<li><p><strong>Semantic matching</strong> - Is the meaning present, even if worded differently?</p>
</li>
</ol>
<p>This became the foundation for <strong>GroundCheck</strong>.</p>
<h3 id="heading-research-how-do-we-know-when-llms-are-uncertain">Research: How Do We Know When LLMs Are Uncertain?</h3>
<p>For the second feature, I researched how to measure LLM confidence:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Method</td><td>Description</td><td>Pros</td><td>Cons</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Temperature</strong></td><td>Controls randomness</td><td>Easy to use</td><td>Doesn't measure confidence</td></tr>
<tr>
<td><strong>Self-consistency</strong></td><td>Run N times, check agreement</td><td>Accurate</td><td>Expensive (N API calls)</td></tr>
<tr>
<td><strong>Verbalized confidence</strong></td><td>Ask "how confident are you?"</td><td>Simple</td><td>Often inaccurate</td></tr>
<tr>
<td><strong>Token logprobs</strong></td><td>Actual token probabilities</td><td>TRUE confidence</td><td>Requires parsing</td></tr>
</tbody>
</table>
</div><p>I chose <strong>token logprobs</strong> because:</p>
<ul>
<li><p>Zero extra API calls (data is already in the response)</p>
</li>
<li><p>Represents the model's actual internal confidence</p>
</li>
<li><p>Sub-millisecond processing time</p>
</li>
<li><p>No additional dependencies</p>
</li>
</ul>
<h3 id="heading-the-two-feature-architecture">The Two-Feature Architecture</h3>
<p>I realized the best solution was <strong>two complementary features</strong>:</p>
<pre><code class="lang-plaintext">┌─────────────────────────────────────────────────────────┐
│                 LLM Extraction                          │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│  Feature 1: Confidence Scoring                          │
│  "How sure was the model when generating this?"         │
│  Uses: Token log probabilities                          │
│  Cost: Zero extra API calls                             │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│  Feature 2: GroundCheck                                 │
│  "Does this value actually exist in the source?"        │
│  Uses: Multi-strategy text matching                     │
│  Cost: Zero API calls (local processing)                │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│  Combined Reliability Score                             │
│  High confidence + Grounded = Reliable ✅               │
│  Low confidence OR Not grounded = Review needed ⚠️      │
└─────────────────────────────────────────────────────────┘
</code></pre>
<hr />
<h2 id="heading-part-3-implementation">Part 3: Implementation</h2>
<h3 id="heading-feature-1-groundcheck">Feature 1: GroundCheck</h3>
<h4 id="heading-core-architecture">Core Architecture</h4>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">GroundCheck</span>:</span>
    <span class="hljs-string">"""Verify extracted data is grounded in source text."""</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">verify</span>(<span class="hljs-params">self, source_text: str, extracted_data: dict</span>) -&gt; VerificationResult:</span>
        <span class="hljs-comment"># For each field, try verification strategies in order:</span>
        <span class="hljs-comment"># 1. Exact match (fastest, highest confidence)</span>
        <span class="hljs-comment"># 2. Numeric match (handles formatting)</span>
        <span class="hljs-comment"># 3. Fuzzy match (handles typos/OCR)</span>
        <span class="hljs-comment"># 4. Semantic match (handles paraphrasing)</span>
        <span class="hljs-keyword">pass</span>
</code></pre>
<h4 id="heading-verification-strategies">Verification Strategies</h4>
<p><strong>1. Exact Matching</strong></p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_exact_match</span>(<span class="hljs-params">self, source: str, value: str</span>) -&gt; tuple:</span>
    <span class="hljs-string">"""Case-insensitive verbatim search."""</span>
    idx = source.lower().find(value.lower())
    <span class="hljs-keyword">if</span> idx != <span class="hljs-number">-1</span>:
        <span class="hljs-keyword">return</span> (<span class="hljs-number">0.99</span>, evidence, (idx, idx + len(value)))
    <span class="hljs-keyword">return</span> (<span class="hljs-number">0.0</span>, <span class="hljs-literal">None</span>, <span class="hljs-literal">None</span>)
</code></pre>
<p><strong>2. Numeric Matching</strong></p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_numeric_match</span>(<span class="hljs-params">self, source: str, value: Any</span>) -&gt; tuple:</span>
    <span class="hljs-string">"""Handle $1,234.56 vs 1234.56 variations."""</span>
    <span class="hljs-comment"># Extract all numbers from source</span>
    <span class="hljs-comment"># Compare with tolerance (0.01 for floats)</span>
    <span class="hljs-comment"># Return confidence 0.90-0.95</span>
</code></pre>
<p><strong>3. Fuzzy Matching</strong></p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_fuzzy_match</span>(<span class="hljs-params">self, source: str, value: str</span>) -&gt; tuple:</span>
    <span class="hljs-string">"""Using rapidfuzz for typos/OCR errors."""</span>
    <span class="hljs-comment"># Sliding window over source</span>
    <span class="hljs-comment"># Token sort ratio comparison</span>
    <span class="hljs-comment"># Return best match with confidence</span>
</code></pre>
<p><strong>4. Semantic Matching</strong> (Optional)</p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_semantic_match</span>(<span class="hljs-params">self, source: str, value: str</span>) -&gt; tuple:</span>
    <span class="hljs-string">"""Embedding-based similarity."""</span>
    <span class="hljs-comment"># Encode source sentences and value</span>
    <span class="hljs-comment"># Cosine similarity comparison</span>
    <span class="hljs-comment"># Return confidence 0.70-0.90</span>
</code></pre>
<h4 id="heading-key-design-decisions">Key Design Decisions</h4>
<ol>
<li><p><strong>Cascading fallback</strong>: Try fastest method first, fall back to slower methods only if needed</p>
</li>
<li><p><strong>Optional dependencies</strong>: Core features work without rapidfuzz or sentence-transformers</p>
</li>
<li><p><strong>Field-level results</strong>: Know exactly which fields are problematic</p>
</li>
<li><p><strong>Evidence extraction</strong>: Show what matched in the source</p>
</li>
</ol>
<h3 id="heading-feature-2-confidence-scoring">Feature 2: Confidence Scoring</h3>
<h4 id="heading-the-math-behind-it">The Math Behind It</h4>
<p>LLMs generate tokens one at a time, each with a probability distribution. The <strong>logprob</strong> is the log of that probability:</p>
<pre><code class="lang-plaintext">logprob = -0.01  →  probability = e^(-0.01) = 0.99  →  Very confident
logprob = -1.00  →  probability = e^(-1.00) = 0.37  →  Somewhat confident
logprob = -3.00  →  probability = e^(-3.00) = 0.05  →  Not confident
</code></pre>
<h4 id="heading-implementation">Implementation</h4>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">ConfidenceScorer</span>:</span>
    <span class="hljs-string">"""Calculate confidence from token logprobs."""</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">score</span>(<span class="hljs-params">self, response: Any, extracted_data: dict</span>) -&gt; ConfidenceResult:</span>
        <span class="hljs-comment"># 1. Extract logprobs from response</span>
        tokens = self.extract_logprobs_openai(response)

        <span class="hljs-comment"># 2. Map tokens to fields</span>
        field_tokens = self.map_tokens_to_fields(tokens, extracted_data)

        <span class="hljs-comment"># 3. Calculate per-field confidence (geometric mean)</span>
        <span class="hljs-keyword">for</span> field, tokens <span class="hljs-keyword">in</span> field_tokens.items():
            probabilities = [t[<span class="hljs-string">"probability"</span>] <span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> tokens]
            confidence = geometric_mean(probabilities)

        <span class="hljs-comment"># 4. Return results with interpretation</span>
        <span class="hljs-keyword">return</span> ConfidenceResult(...)
</code></pre>
<h4 id="heading-why-geometric-mean">Why Geometric Mean?</h4>
<p>I chose geometric mean over arithmetic mean because it's more conservative:</p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-comment"># Arithmetic mean: [0.99, 0.99, 0.10] → 0.69</span>
<span class="hljs-comment"># Geometric mean:  [0.99, 0.99, 0.10] → 0.46</span>

<span class="hljs-comment"># The geometric mean correctly penalizes that one uncertain token</span>
</code></pre>
<h4 id="heading-performance-optimization">Performance Optimization</h4>
<p>The entire scoring process:</p>
<ul>
<li><p><strong>Zero API calls</strong> - Uses data already in the response</p>
</li>
<li><p><strong>&lt; 1ms processing</strong> - Simple math operations</p>
</li>
<li><p><strong>Zero dependencies</strong> - Pure Python standard library</p>
</li>
</ul>
<hr />
<h2 id="heading-part-4-integration-with-instructor">Part 4: Integration with Instructor</h2>
<h3 id="heading-making-it-feel-native">Making It Feel Native</h3>
<p>I wanted users to be able to use these features as naturally as any other Instructor feature:</p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-comment"># Before: Just extraction</span>
result = client.chat.completions.create(
    response_model=Invoice,
    messages=[...]
)

<span class="hljs-comment"># After: Extraction + Reliability</span>
<span class="hljs-keyword">from</span> instructor <span class="hljs-keyword">import</span> verify_extraction, score_confidence

result = client.chat.completions.create(
    response_model=Invoice,
    messages=[...],
    logprobs=<span class="hljs-literal">True</span>  <span class="hljs-comment"># Enable for confidence scoring</span>
)

<span class="hljs-comment"># Check confidence</span>
confidence = score_confidence(response, result.model_dump())

<span class="hljs-comment"># Check grounding</span>
grounding = verify_extraction(source_text, result.model_dump())

<span class="hljs-comment"># Combined reliability</span>
is_reliable = confidence.overall &gt;= <span class="hljs-number">0.85</span> <span class="hljs-keyword">and</span> grounding.is_reliable
</code></pre>
<h3 id="heading-multiple-integration-patterns">Multiple Integration Patterns</h3>
<p>I implemented several ways to use GroundCheck:</p>
<p><strong>1. Direct Function Call</strong></p>
<p>python</p>
<pre><code class="lang-python">result = verify_extraction(source_text, extracted_data)
</code></pre>
<p><strong>2. Decorator Pattern</strong></p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-meta">@with_grounding(source_text=document, threshold=0.8)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">extract_invoice</span>():</span>
    <span class="hljs-keyword">return</span> client.chat.completions.create(...)
</code></pre>
<p><strong>3. Wrapper Class</strong></p>
<p>python</p>
<pre><code class="lang-python">grounded = GroundedExtractor(client)
result = grounded.extract(response_model=Invoice, source_text=doc, ...)
</code></pre>
<p><strong>4. Pydantic Validator</strong></p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Invoice</span>(<span class="hljs-params">BaseModel</span>):</span>
    vendor: Annotated[str, BeforeValidator(grounding_validator(source))]
</code></pre>
<hr />
<h2 id="heading-part-5-testing-strategy">Part 5: Testing Strategy</h2>
<h3 id="heading-test-categories">Test Categories</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Category</td><td>Tests</td><td>Purpose</td></tr>
</thead>
<tbody>
<tr>
<td>Basic Functionality</td><td>10</td><td>Core verification logic</td></tr>
<tr>
<td>Edge Cases</td><td>8</td><td>Empty inputs, Unicode, special chars</td></tr>
<tr>
<td>Real-world Scenarios</td><td>6</td><td>Invoice, medical, legal documents</td></tr>
<tr>
<td>Integration</td><td>6</td><td>Decorator, wrapper, validator</td></tr>
<tr>
<td>Performance</td><td>4</td><td>Processing time &lt; 10ms</td></tr>
<tr>
<td>Error Handling</td><td>4</td><td>Graceful degradation</td></tr>
</tbody>
</table>
</div><h3 id="heading-example-test-medical-record-hallucination">Example Test: Medical Record Hallucination</h3>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_medical_record</span>(<span class="hljs-params">self</span>):</span>
    <span class="hljs-string">"""High-stakes scenario test."""</span>
    clinical_note = <span class="hljs-string">"""
    Patient: John Smith, DOB 03/15/1980
    Vital Signs: BP 145/92, HR 88
    Plan: Start aspirin 81mg daily
    """</span>

    extracted = {
        <span class="hljs-string">"patient_name"</span>: <span class="hljs-string">"John Smith"</span>,
        <span class="hljs-string">"medication"</span>: <span class="hljs-string">"aspirin 81mg"</span>,
        <span class="hljs-string">"allergies"</span>: <span class="hljs-string">"Penicillin"</span>,      <span class="hljs-comment"># HALLUCINATED!</span>
        <span class="hljs-string">"surgery_scheduled"</span>: <span class="hljs-string">"Yes"</span>,      <span class="hljs-comment"># HALLUCINATED!</span>
    }

    result = verify_extraction(clinical_note, extracted)

    <span class="hljs-comment"># Must catch these dangerous hallucinations</span>
    <span class="hljs-keyword">assert</span> <span class="hljs-string">"allergies"</span> <span class="hljs-keyword">in</span> result.flagged_fields
    <span class="hljs-keyword">assert</span> <span class="hljs-string">"surgery_scheduled"</span> <span class="hljs-keyword">in</span> result.flagged_fields
</code></pre>
<h3 id="heading-results">Results</h3>
<pre><code class="lang-plaintext">============================= 38 passed in 4.48s =============================
</code></pre>
<p>All 38 tests pass consistently.</p>
<hr />
<h2 id="heading-part-6-documentation">Part 6: Documentation</h2>
<p>Good code deserves good documentation. I created:</p>
<ol>
<li><p><strong>Concept Documentation</strong> (<code>docs/concepts/groundcheck.md</code>, <code>docs/concepts/confidence.md</code>)</p>
<ul>
<li><p>Problem explanation</p>
</li>
<li><p>Quick start guide</p>
</li>
<li><p>API reference</p>
</li>
<li><p>Best practices</p>
</li>
</ul>
</li>
<li><p><strong>Working Examples</strong> (<code>examples/groundcheck/</code>, <code>examples/confidence/</code>)</p>
<ul>
<li><p>Basic usage</p>
</li>
<li><p>Real-world scenarios</p>
</li>
<li><p>Mock mode for testing without API key</p>
</li>
</ul>
</li>
<li><p><strong>Inline Documentation</strong></p>
<ul>
<li><p>Comprehensive docstrings</p>
</li>
<li><p>Type hints throughout</p>
</li>
<li><p>Usage examples in docstrings</p>
</li>
</ul>
</li>
</ol>
<hr />
<h2 id="heading-part-7-the-open-source-process">Part 7: The Open Source Process</h2>
<h3 id="heading-forking-and-setup">Forking and Setup</h3>
<p>bash</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Fork the repository</span>
gh repo fork 567-labs/instructor

<span class="hljs-comment"># Clone and setup</span>
git <span class="hljs-built_in">clone</span> https://github.com/Ruthvik-Bandari/instructor.git
<span class="hljs-built_in">cd</span> instructor
python -m venv venv
<span class="hljs-built_in">source</span> venv/bin/activate
pip install -e <span class="hljs-string">".[dev]"</span>
</code></pre>
<h3 id="heading-development-workflow">Development Workflow</h3>
<ol>
<li><strong>Create feature branch</strong></li>
</ol>
<p>bash</p>
<pre><code class="lang-bash">   git checkout -b feature/groundcheck-verification
</code></pre>
<ol start="2">
<li><p><strong>Implement incrementally</strong></p>
<ul>
<li><p>Core functionality first</p>
</li>
<li><p>Tests as I go</p>
</li>
<li><p>Documentation at the end</p>
</li>
</ul>
</li>
<li><p><strong>Code quality checks</strong></p>
</li>
</ol>
<p>bash</p>
<pre><code class="lang-bash">   ruff check instructor/groundcheck.py --fix
   ruff format instructor/groundcheck.py
   pytest tests/ -v
</code></pre>
<ol start="4">
<li><strong>Submit PR with comprehensive description</strong></li>
</ol>
<h3 id="heading-responding-to-code-review">Responding to Code Review</h3>
<p>The Ellipsis bot provided automated review. Key feedback:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Issue</td><td>My Response</td></tr>
</thead>
<tbody>
<tr>
<td>Duplicate fuzzy matching calls</td><td>Fixed: Store result for reuse</td></tr>
<tr>
<td>Hardcoded threshold in <strong>post_init</strong></td><td>Fixed: Removed override</td></tr>
<tr>
<td>Wrong method label for complex fields</td><td>Fixed: Added AGGREGATE method</td></tr>
</tbody>
</table>
</div><hr />
<h2 id="heading-part-8-results-and-impact">Part 8: Results and Impact</h2>
<h3 id="heading-contribution-statistics">Contribution Statistics</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Metric</td><td>Value</td></tr>
</thead>
<tbody>
<tr>
<td>Lines of Code</td><td><strong>2,297</strong></td></tr>
<tr>
<td>Tests</td><td><strong>38</strong></td></tr>
<tr>
<td>Files Created</td><td><strong>8</strong></td></tr>
<tr>
<td>Documentation Pages</td><td><strong>2</strong></td></tr>
<tr>
<td>Examples</td><td><strong>2</strong></td></tr>
</tbody>
</table>
</div><h3 id="heading-what-users-can-now-do">What Users Can Now Do</h3>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> instructor <span class="hljs-keyword">import</span> (
    <span class="hljs-comment"># Hallucination Detection</span>
    GroundCheck,
    verify_extraction,
    HallucinationError,

    <span class="hljs-comment"># Confidence Scoring</span>
    ConfidenceScorer,
    score_confidence,
    ConfidenceLevel,
    LowConfidenceError,
)

<span class="hljs-comment"># Complete reliability check</span>
confidence = score_confidence(response, data)
grounding = verify_extraction(source_text, data)

<span class="hljs-keyword">if</span> confidence.overall &lt; <span class="hljs-number">0.85</span>:
    print(<span class="hljs-string">f"⚠️ Low confidence fields: <span class="hljs-subst">{confidence.low_confidence_fields}</span>"</span>)

<span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> grounding.is_reliable:
    print(<span class="hljs-string">f"⚠️ Hallucinated fields: <span class="hljs-subst">{grounding.flagged_fields}</span>"</span>)
</code></pre>
<hr />
<h2 id="heading-lessons-learned">Lessons Learned</h2>
<h3 id="heading-1-research-before-coding">1. Research Before Coding</h3>
<p>Spending time understanding the problem space saved me from building the wrong thing.</p>
<h3 id="heading-2-design-for-integration">2. Design for Integration</h3>
<p>Features should feel native to the library, not bolted on.</p>
<h3 id="heading-3-test-real-scenarios">3. Test Real Scenarios</h3>
<p>Unit tests are good, but real-world scenario tests catch practical issues.</p>
<h3 id="heading-4-document-as-you-go">4. Document As You Go</h3>
<p>Writing documentation helped me think through edge cases.</p>
<h3 id="heading-5-optimize-for-developer-experience">5. Optimize for Developer Experience</h3>
<p>Zero extra API calls, optional dependencies, multiple integration patterns.</p>
<hr />
<h2 id="heading-whats-next">What's Next?</h2>
<p>Potential future enhancements:</p>
<ol>
<li><p><strong>Async Support</strong> - <code>async def verify()</code> for high-throughput applications</p>
</li>
<li><p><strong>Streaming Integration</strong> - Verify partial extractions as they stream</p>
</li>
<li><p><strong>Custom Verification Strategies</strong> - Plugin architecture for domain-specific matching</p>
</li>
<li><p><strong>Confidence Calibration</strong> - Historical accuracy tracking</p>
</li>
</ol>
<hr />
<h2 id="heading-conclusion">Conclusion</h2>
<p>What started as a desire to contribute to open source became a deep dive into LLM reliability. By identifying a real problem, researching solutions thoroughly, and implementing with care for developer experience, I was able to create something genuinely useful.</p>
<p>The combination of <strong>GroundCheck</strong> (hallucination detection) and <strong>Confidence Scoring</strong> (model certainty) provides a complete solution for knowing when to trust LLM extractions - critical for any production application.</p>
<p><strong>PR #1968</strong>: <a target="_blank" href="https://github.com/567-labs/instructor/pull/1968">github.com/567-labs/instructor/pull/1968</a></p>
<hr />
<p><em>Ruthvik Bandari is a Master's student in Applied AI at Northeastern University. Connect with me on</em> <a target="_blank" href="https://www.linkedin.com/in/ruthvik-nath-bandari-908b00247/"><em>LinkedIn</em></a> <em>or</em> <a target="_blank" href="https://github.com/Ruthvik-Bandari"><em>GitHub</em></a><em>.</em></p>
]]></content:encoded></item></channel></rss>