How to Prevent LLM Output Failures in Python Beyond JSON Mode

When you ship an LLM-powered structured-output endpoint, it’s easy to assume JSON mode guarantees reliability. But what happens when the model returns valid JSON that doesn’t match your schema? Or worse, silently truncates critical data?

That’s exactly what happened to me in March. My team deployed a GPT-4.1-based endpoint with strict JSON mode and a green evaluation suite. Three weeks later, a downstream billing job had silently skipped 4,200 records over a weekend. The output was technically valid JSON—but it wasn’t our JSON.

That experience forced me to rethink how we handle structured outputs. Since then, I’ve built and deployed four more LLM systems, and the failures consistently stem from the same blind spots—even with JSON mode enabled. While JSON mode catches some issues like truncation or basic type mismatches, it misses critical failure modes such as hallucinated keys, semantic drift, or schema-version mismatches.

Why JSON Mode Alone Isn’t Enough

Analyzing two months of incident logs across enterprise deployments, I identified six common failure patterns that JSON mode fails to prevent:

Silent truncation: When the model exhausts max_tokens mid-object, returning partial JSON that parses but omits critical data.

Hallucinated keys: The model invents field names like customer_id instead of the required client_id, even with strict: true in strict mode.

Type coercion: Returning numbers as formatted strings (e.g., "1,499.00" instead of 1499.00), which breaks downstream processing.

Semantic drift: The JSON is schema-compliant, but the values are incorrect—wrong customer IDs, amounts, or country codes.

Refusals in disguise: Safety filters trigger, but the model wraps its refusal in JSON (e.g., {"refusal": "I can't help with that"}), which parses as a valid response.

Schema-version desync: A new field is added to the schema, but an in-flight worker uses an old version, causing batch failures until someone notices.

JSON mode catches truncation and type coercion only sometimes. The other four require proactive validation, iterative healing, and continuous monitoring.

A Production-Grade Toolkit for Structured Outputs

To address these gaps, I built a modular Python toolkit consisting of four core components:

A strict validator that runs after JSON mode to catch what JSON mode misses.
A healing retry loop that feeds validation errors back to the model—not a blind retry.
A cost-bounded fallback chain to prevent runaway token usage from bad prompts.
A drift detector to track parse compliance and field-distribution shifts over time.

This toolkit is designed to be dropped into a FastAPI service with minimal configuration.

File Structure Overview

llm_structured/
├── schemas.py          # Pydantic models with versioning
├── validator.py        # Strict validation beyond JSON mode
├── healer.py           # Healing retry loop
├── budget.py           # Per-request and global cost caps
├── chain.py            # Multi-provider fallback with circuit breaker
├── observability.py    # Metrics + drift detection
└── service.py          # FastAPI endpoint that ties it all together

Install Dependencies

pip install pydantic==2.7.4 openai==1.30.0 anthropic==0.30.0 \
       tenacity==8.3.0 prometheus-client==0.20.0 fastapi==0.111.0 \
       uvicorn==0.30.1 httpx==0.27.0

1. Schema Design with Versioning

Schema versioning isn’t just an academic exercise—it’s a production safety net. When two services run different schema versions during a deployment, downstream jobs fail silently or throw obscure errors. Including a schema_version field in every output forces consumers to validate compatibility explicitly.

Here’s how we implement it using Pydantic:

from pydantic import BaseModel, Field, field_validator
from typing import Literal
from decimal import Decimal

class InvoiceLineV2(BaseModel):
    schema_version: Literal["2.0"] = "2.0"
    client_id: str = Field(min_length=3, max_length=64)
    amount: Decimal = Field(gt=0, decimal_places=2)
    currency: Literal["USD", "EUR", "GBP", "INR"]
    invoice_date: str = Field(pattern=r"^\d{4}-\d{2}-\d{2}$")
    line_items: list[str] = Field(min_length=1, max_length=50)
    confidence: float = Field(ge=0.0, le=1.0)

    @field_validator("amount", mode="before")
    @classmethod
    def coerce_amount(cls, v):
        if isinstance(v, str):
            cleaned = v.replace(",", "").replace("$", "").strip()
            return Decimal(cleaned)
        return v

def schema_for_prompt(model: type[BaseModel]) -> dict:
    """Return a JSON-schema dict suitable for OpenAI response_format."""
    return {
        "type": "json_schema",
        "json_schema": {
            "name": model.__name__,
            "schema": model.model_json_schema(),
            "strict": True,
        },
    }

The schema_version field ensures every output carries its generation context. Downstream systems can immediately reject outputs with unknown versions instead of attempting to process incompatible data.

2. Strict Validation Beyond JSON Mode

JSON mode with strict: true validates types and required fields but ignores refusals, semantic anchors, and partial truncation. A second-pass validator catches these edge cases.

Here’s the core validation logic:

from pydantic import BaseModel, ValidationError
import json
import re

REFUSAL_PATTERNS = [
    r"i can'?t help",
    r"i'?m not able to",
    r"as an ai",
    r"i'?m unable to provide",
]

class ValidationResult:
    def __init__(self, ok: bool, value=None, errors=None, raw=None):
        self.ok = ok
        self.value = value
        self.errors = errors or []
        self.raw = raw

def validate(raw: str, model: type[BaseModel]) -> ValidationResult:
    if not raw or not raw.strip():
        return ValidationResult(False, errors=["empty_response"], raw=raw)

    lower = raw.lower()
    for pat in REFUSAL_PATTERNS:
        if re.search(pat, lower):
            return ValidationResult(False, errors=["refusal_detected"], raw=raw)

    try:
        parsed = json.loads(raw)
    except json.JSONDecodeError as e:
        return ValidationResult(False, errors=[f"json_decode: {e}"], raw=raw)

    try:
        instance = model.model_validate(parsed)
    except ValidationError as e:
        return ValidationResult(False, errors=_format_errors(e), raw=raw)

    return ValidationResult(True, value=instance, raw=raw)

def _format_errors(e: ValidationError) -> list[str]:
    out = []
    for err in e.errors():
        loc = ".".join(str(p) for p in err["loc"])
        out.append(f"{loc}: {err['msg']}")
    return out

This validator does three critical things:

Detects refusals by searching for common phrases in the raw output.
Parses JSON safely and provides detailed validation errors.
Formats errors in plain English to avoid overwhelming the model during retries.

3. Healing Retries: Iterative Improvement

Blind retries with the same prompt and temperature=0 often reproduce the same failure. Instead, the healing loop feeds validation errors back to the model in a structured way—telling it exactly what went wrong and how to fix it.

For example, if the model returns customer_id instead of client_id, the retry prompt includes:

The output contains a field customer_id, but the schema requires client_id. Please regenerate the output using the correct field name.

This approach significantly improves success rates in one or two iterations.

4. Cost Control and Fallback Chains

Bad prompts can burn thousands of dollars in token costs. A cost-bounded fallback chain mitigates this by:

Setting per-request and global token budgets.
Falling back to cheaper providers (e.g., Anthropic, open-source models) if the primary model fails or exceeds budget.
Using circuit breakers to prevent cascading failures.

5. Observability and Drift Detection

Continuous monitoring tracks:

Parse compliance rates.
Field distribution shifts (e.g., sudden spikes in confidence values).
Latency and cost per request.

Prometheus metrics and drift alerts ensure anomalies are caught before they impact downstream systems.

Final Thoughts

Relying solely on JSON mode for structured outputs is a risky shortcut. The reality is that LLMs introduce subtle, recurring failure modes—hallucinated keys, semantic drift, and schema mismatches—that demand layered defense.

This toolkit doesn’t just validate outputs; it actively heals them, controls costs, and monitors for drift. By integrating it early, you can avoid the silent data loss that derails billing systems, customer workflows, and trust in AI-powered services.

AI summary

Learn how to build production-grade LLM structured-output systems in Python with strict validation, healing retries, cost control, and drift detection.