New benchmark exposes flaws in LLM structured outputs beyond schema checks

When deploying language models in production workflows, developers frequently rely on structured outputs—such as JSON—to ensure consistency in tasks like invoice parsing or meeting transcript processing. Yet even when models return schema-compliant JSON, critical errors can lurk within the values themselves. A misstated invoice date, an incorrectly ordered transcript entry, or a plausible but wrong numeric field can disrupt downstream systems, even when the JSON structure itself is valid.

To address this gap, researchers have introduced the Structured Output Benchmark (SOB), a new evaluation framework that not only checks schema compliance but also verifies the factual accuracy of every value in the output. Unlike existing benchmarks such as JSONSchemaBench, which focus solely on structural validation, SOB pairs each test case with a manually verified ground truth. This ensures that models are penalized not just for invalid schemas, but for hallucinations, omissions, or subtle inaccuracies that might otherwise evade detection.

Key Features of the Structured Output Benchmark (SOB)
- Validates both JSON schema compliance and value-level accuracy
- Tests across three modalities: text, images, and audio
- Uses human-verified ground truth for every record
- Measures "structured hallucinations"—plausible yet incorrect values

The benchmark’s findings reveal unexpected performance variations among leading models. While proprietary models like GPT-5.4 and GPT-5 dominate in some areas, open-source alternatives such as GLM-4.7 and Gemma-4-31B demonstrate surprising strength. For instance, GLM-4.7 ranks second overall and leads in text-based tasks, while Gemma-4-31B outperforms all competitors in image-based structured outputs. Meanwhile, Gemini-2.5-Flash takes the top spot for audio processing.

These results challenge assumptions about model scaling and specialization. Even smaller models like Phi-4 (14B) outperform larger proprietary models in specific text-based evaluations, and GLM-4.7 surpasses GPT-5 and Claude-Sonnet-4.6 in value accuracy metrics. The data suggests that deterministic performance is not solely a function of model size or commercial backing, but of how well a model’s training aligns with structured, factual outputs.

One of the most insidious challenges highlighted by SOB is the prevalence of "structured hallucinations"—errors where the output adheres to the schema and appears plausible, yet contains factual inaccuracies. For example, in an audio transcription task, a model might return "target_market_age": "25 to 35" when the ground truth specifies "15 to 35 years". Such discrepancies are invisible to traditional validation tools, which only check structure and types. This underscores the need for field-level, context-aware accuracy checks in production deployments.

The team behind SOB emphasizes its role as a diagnostic tool rather than a definitive ranking system. Their goal is to push the industry toward more reliable deterministic workflows by shining a light on the limitations of current structured output evaluations. As AI systems increasingly power critical business processes, the ability to trust model outputs at both the schema and value levels will become a defining factor in their adoption. The benchmark’s open-source nature invites collaboration, ensuring that future model improvements are measured against a more rigorous standard.

AI summary

Yapay zeka modellerinin yapısal çıktılarındaki değer doğruluğunu ölçen SOB, JSON şeması yanı sıra içerik doğruluğunu da test ediyor. GLM-4.7 ve GPT-5.4'in liderlik yarışındaki detaylar ve yapısal halüsinasyonların gizli tehlikeleri.

New benchmark exposes flaws in LLM structured outputs beyond schema checks

Comments

Netomi secures $110M to redefine enterprise AI for customer service

AWS integrates OpenAI models—why the cloud AI landscape just flipped

Hybrid retrieval overtakes pure vector RAG as enterprises seek scalable AI accuracy