New benchmark exposes flaws in LLM structured outputs beyond schema checks
Structured output benchmarks often overlook value accuracy in LLM-generated JSON. A new benchmark reveals surprising gaps even in top models like GPT-5 and Claude, with rankings shifting dramatically across text, images, and audio.