AI agents are increasingly deployed in critical workflows, yet their failures—hallucinations, looping behavior, context drops, or tool misuse—often go undetected until it’s too late. When these systems break, the go-to solution is typically another large language model (LLM), tasked with sifting through execution traces to pinpoint errors. But what if the solution doesn’t need to be another LLM?
New research from Pisama demonstrates that purpose-built heuristic detectors can outperform even the latest frontier models in identifying agent failures, offering a faster, cheaper, and more reliable alternative for many use cases. The findings challenge the prevailing assumption that LLMs are the best tool for the job in agent reliability monitoring.
The limits of LLM-based failure detection
The default approach to diagnosing AI agent failures involves feeding execution traces—sequences of tool calls, inputs, and outputs—into an LLM and asking it to identify what went wrong. This method, known as LLM-as-judge, relies on the model’s ability to interpret high-dimensional, unstructured data and reason about errors. However, benchmarks reveal significant limitations.
On the TRAIL benchmark developed by Patronus AI, which contains 148 real-world agent traces with 841 human-labeled failures across 21 categories, even the most advanced models struggle. GPT-5.4 correctly identifies only 11.9% of failures, while Claude Sonnet 4.6 performs slightly worse at 6.9%. These results highlight the inherent difficulty of parsing complex agent behaviors through semantic reasoning alone.
Heuristic detectors: Faster, cheaper, and more precise
Pisama’s research introduces a tiered system that leverages heuristic detectors—rule-based checks designed to catch specific failure patterns—before escalating to LLMs for nuanced cases. The approach delivers dramatic improvements in both accuracy and efficiency.
On the TRAIL benchmark, Pisama’s 20 core heuristic detectors achieved a joint accuracy of 60.1% with 100% precision, meaning every detected failure was genuine. In contrast, the best-performing LLM baseline identified only 11.9% of failures. The heuristic detectors also operate at zero cost and complete in just 21 seconds, compared to the latency and expense associated with LLM-based methods.
The breakdown across failure categories further underscores the advantages of heuristic detection:
- Context handling: Heuristics achieved an F1 score of 0.978, compared to 0.00 for LLMs, highlighting their strength in detecting when agents ignore critical input elements like dates or proper nouns.
- Specification matching: Detectors scored a perfect 1.000 F1 in identifying when outputs fail to meet stated requirements, such as when an agent ignores a request for a REST API and instead outputs an HTML form.
- Loop and resource abuse: Heuristics detected 100% of looping behaviors, a category where LLMs perform poorly.
- Hallucination detection: By correlating tool failures with output claims, heuristic detectors achieved an F1 of 0.884, compared to 0.59 for the best LLM baseline.
These results suggest that many agent failures leave structural signatures—repeated states, missing elements, or tool misuse—that are more effectively captured by rule-based patterns than by LLMs attempting to reason about intent.
Multi-agent systems: Where heuristics need reinforcement
While heuristic detectors excel in identifying when a failure occurs, they struggle with who caused it—a critical challenge in multi-agent environments. For example, determining whether a failure stems from a web search agent selecting the wrong link or an orchestrator providing ambiguous instructions requires semantic understanding.
To address this, Pisama employs a hybrid approach: heuristic detectors first identify the step where a failure occurred, then a lightweight LLM call attributes blame. This tiered system delivers superior performance. On the Who&When benchmark (ICML 2025), Pisama’s heuristic detectors alone achieved an agent accuracy of 31.0% and step accuracy of 16.8%. When combined with a single Sonnet 4 call for attribution, accuracy jumped to 60.3% for agent identification and 24.1% for step localization, outperforming GPT-5.4 Mini’s 60.3% and 22.4%, respectively.
The hybrid method reduces reliance on expensive LLM calls while ensuring high accuracy in complex scenarios. By reserving LLMs for nuanced tasks like blame attribution, the system minimizes costs without sacrificing reliability.
The bigger picture: Why simple rules often win
The success of heuristic detectors aligns with broader research in decision science, particularly the work of Gerd Gigerenzer. His studies demonstrate that in uncertain environments, simple rules focused on the most diagnostic cues frequently outperform complex models that attempt to weigh all available information. Agent failure detection is a prime example: high-dimensional traces where a single structural signal—such as state repetition or tool failure—contains most of the information needed to identify errors.
This principle explains why heuristic detectors outperform LLMs in many failure categories. Instead of trying to interpret the agent’s behavior semantically, they measure concrete, quantifiable patterns that are inherently tied to failure modes. For instance, a loop detector doesn’t need to understand why an agent is stuck—it simply counts repeated tool calls. Similarly, a hallucination detector doesn’t parse the agent’s output for coherence; it checks whether the output aligns with tool results.
What’s next for agent reliability
Heuristic detectors are not a silver bullet. They excel at identifying known failure patterns but struggle with novel or context-dependent errors that require deep semantic reasoning. For example, determining whether a failure results from ambiguous instructions or a cascading error across agents remains a challenge for rule-based systems.
However, the hybrid approach—combining heuristics for detection with LLMs for attribution—offers a promising path forward. As AI agents become more complex and widely deployed, the need for reliable, scalable failure detection will only grow. Heuristic detectors provide a cost-effective, high-precision solution for many common failure modes, while LLMs handle the edge cases that demand deeper analysis.
The future of agent reliability may lie not in choosing between heuristics and LLMs, but in strategically deploying both where they perform best.
AI summary
New research shows heuristic detectors identify 60% of AI agent failures with zero false positives—far surpassing LLM-based methods. Learn how rule-based systems improve reliability and reduce costs.