AI benchmarks often reduce complex security evaluations to a single percentage, masking critical differences between models. A recent study that dissected 700 AI-generated functions across five security domains found that the "safest" model overall could be the worst performer in specific tasks, while the "most dangerous" model excelled in niche categories.
Researchers evaluated five leading models—Haiku 4.5, Sonnet 4.5, Opus 4.6, Gemini 2.5 Flash, and Gemini 2.5 Pro—by analyzing their vulnerability rates and fix rates across diverse security functions. The findings challenge the assumption that a single model can dominate all use cases, emphasizing the need for domain-specific evaluations.
The Flaw in Aggregate Rankings
Traditional AI security benchmarks rank models based on overall vulnerability rates, often presenting a misleading hierarchy. For example, a model might score well in aggregate metrics but fail in critical areas like authentication or database operations. Aggregates flatten nuanced performance into a single number, obscuring where models truly excel or underperform.
Take the following aggregate rankings from a prior study:
- Haiku 4.5: 49% vulnerability rate (labeled "safest")
- Sonnet 4.5: 62%
- Gemini 2.5 Flash: 64%
- Opus 4.6: 65%
- Gemini 2.5 Pro: 73% (labeled "most dangerous")
These rankings suggest a clear winner and loser, but they ignore the context of how each model performs in specific domains. The reality? The "safest" model in aggregate might be the worst at fixing critical security issues, while the "most dangerous" model could dominate in remediation.
Domain-Specific Performance Reveals Surprising Leaders
Breaking down the 700 functions into five security domains—Database Operations, Authentication, File I/O, Command Execution, and Configuration & Secrets—reveals stark contrasts in model strengths. Here’s how each domain reshaped the rankings:
Vulnerability Champions: Lowest Rates by Domain
- Database Operations (PostgreSQL):
- Haiku 4.5: 39% (fewest vulnerabilities)
- Opus 4.6: 61%
- Sonnet 4.5: 71%
- Gemini 2.5 Flash: 75%
- Gemini 2.5 Pro: 96%
Haiku’s simplicity shines here, generating minimal, parameterized queries that avoid common pitfalls like hardcoded credentials. In contrast, Gemini 2.5 Pro’s production-grade code—with connection pooling and error handling—introduces more surface area for security rules.
- Authentication (JWT, bcrypt):
- Haiku 4.5: 29% (most secure)
- Sonnet 4.5: 39%
- Gemini 2.5 Flash: 43%
- Gemini 2.5 Pro: 43%
- Opus 4.6: 50%
Opus 4.6 stands out for its consistent failure in generating secure JWT payloads, embedding sensitive user data in every instance. Meanwhile, Gemini 2.5 Flash achieved a perfect score in JWT generation, using minimal payloads with only user IDs.
- File I/O (Uploads, Reads, Deletes):
- Gemini 2.5 Pro: 86% (best in class)
- Haiku 4.5: 93%
- Opus 4.6: 93%
- Sonnet 4.5: 100%
File I/O proves the hardest category for all models, as dynamic filenames trigger security rules like detect-non-literal-fs-filename. Even here, Gemini 2.5 Pro’s tendency to add path sanitization gives it a slight edge.
- Command Execution (Shell Operations):
- Haiku 4.5: 50% (most secure)
- Sonnet 4.5: 75%
- Gemini 2.5 Flash: 82%
- Gemini 2.5 Pro: 93%
- Opus 4.6: 96%
Haiku’s advantage lies in its occasional use of library APIs (e.g., archiver) instead of shell commands, reducing security rule violations.
- Configuration & Secrets:
- Gemini 2.5 Flash: 21% (most secure)
- Sonnet 4.5: 25%
- Opus 4.6: 25%
Remediation Champions: Highest Fix Rates by Domain
- Database Operations:
- Gemini 2.5 Pro: 93% (highest fix rate)
- Gemini 2.5 Flash: 67%
- Authentication:
- Opus 4.6: 100% (perfect fix rate)
- Gemini Pro: 58%
- File I/O:
- Opus 4.6: 73%
- Haiku 4.5: 58%
- Configuration & Secrets:
- Flash / Opus: 100% (perfect fix rate)
- Sonnet 4.5: 43%
- Command Execution:
- Opus 4.6: 19% (lowest fix rate in this category)
- Haiku 4.5: 7%
The Takeaway: No Universal Winner
The study underscores a critical insight: there is no single "best" AI model for all security tasks. A model’s performance is highly dependent on the domain and the specific requirements of the function. For instance:
- Haiku 4.5 excels in database operations and authentication but struggles with command execution remediation.
- Opus 4.6 dominates in authentication remediation but fails consistently in JWT generation.
- Gemini 2.5 Pro leads in database remediation but lags in vulnerability rates for the same category.
This variability suggests that organizations should evaluate AI models based on their intended use cases rather than relying on aggregate benchmarks. The right model for a database-heavy application may be entirely unsuitable for file handling or authentication workflows.
A Practical Framework for AI Security Evaluation
To avoid the pitfalls of aggregate rankings, consider these steps when selecting an AI model for security-critical tasks:
- Define your primary security domains. Identify the most common functions your applications will perform (e.g., database queries, authentication, file operations).
- Benchmark models per domain. Use domain-specific prompts and security rules to evaluate performance, rather than relying on a single aggregate score.
- Prioritize remediation rates. A model with a high fix rate may be more valuable than one with a low vulnerability rate, especially if it reduces manual intervention.
- Test edge cases. Ensure the model handles atypical scenarios, such as dynamic filenames or complex authentication flows, without triggering false positives.
As AI systems become more integrated into security workflows, the need for granular, domain-specific evaluations will only grow. The days of relying on a single benchmark number are over—real-world security demands precision, not generalization.
The future of AI security lies in understanding where each model shines, not in crowning a single victor. By embracing this nuance, organizations can make smarter, safer choices—tailoring their AI stack to their unique needs rather than chasing an illusory top rank.
AI summary
Tek bir sayıyla yapılan AI model karşılaştırmaları yanıltıcı olabilir. 700 güvenlik fonksiyonunu analiz eden araştırma, hangi modelin hangi görevde en güvenli olduğunu ortaya koyuyor.