AI code is polished—but hidden bugs can slip through unseen

In 2026, a CIT company faced a costly logistics nightmare: 42% of ATMs ran out of cash because their routing model relied on averages. A machine learning solution slashed the failure rate to 25%, but the real shock came when testing revealed three deep flaws in the AI-generated code—flaws invisible to linters, reviewers, and even some automated tests.

The project, which combined deterministic routing with a machine learning model, was running smoothly on paper. Yet when researchers compared its quantum-inspired quantum approximate optimization algorithm (QAOA) implementation against a classical mixed-integer linear programming (MILP) baseline, the results made no sense. The numbers seemed correct—identical energy values, identical outputs—but the constraints were being violated. Something was fundamentally broken.

From 42% stockouts to a 51% cost cut

Most ATM cash routing models assume demand follows predictable patterns. But real-world cash needs fluctuate unpredictably. For a fleet of vehicles servicing 20 ATMs with daily demand variations, the classical capacitated vehicle routing problem (CVRP) fails spectacularly. In this case, the traditional approach resulted in stockouts for 42% of ATMs, leading to emergency refills, overtime labor, and lost customer trust.

The team’s solution layered two approaches:

A deterministic baseline using CVRP
A decision-focused learning model (SPO+) that adapts to demand uncertainty

By integrating uncertainty directly into the optimization process, the model reduced stockouts to 25%—a 40% improvement—while cutting operational costs by over half. The gains were undeniable, but the implementation process highlighted a growing blind spot in AI-driven development: code that looks correct isn’t necessarily behaving correctly.

Quantum-inspired AI code with glaring contradictions

The project expanded into quantum computing territory when researchers explored QAOA using Q# to solve the same routing problem. Within hours, an AI assistant delivered a polished, complete codebase with tables, charts, and performance metrics. Everything looked professional and aligned with expectations.

Then the inconsistencies appeared.

The AI-generated QAOA code returned the same energy value as the MILP solver—despite visibly violating capacity constraints. Even more puzzling, the results for p=1 and p=3 (two different circuit depths) matched to four decimal places. In optimization, deeper circuits should produce better—or at least different—results. This wasn’t just a typo. It was a logical impossibility.

# phase2c_qaoa_simulator.py, line 56
LAMBDA_C = 0.5  # Penalty weight was far too small

The penalty term for violating vehicle capacity was set to 0.5, but the actual route costs ranged from 1.2 to 3.5. With a total demand of 330,000 TL and a vehicle capacity of 250,000 TL, violating the constraint incurred a penalty of just 0.051 units—less than 1.5% of the route cost. QAOA couldn’t distinguish between a valid and invalid solution because the penalty was practically invisible.

Changing the penalty to 40.0—exceeding the maximum route cost—fixed the issue. Suddenly, violations mattered, and the model’s behavior aligned with expectations.

Silent contradictions and the wrong metric

The second bug emerged from the MILP solver itself. The model enforced two constraints: visit all ATMs and never exceed vehicle capacity. But total demand (330k TL) exceeded total capacity (250k TL). No solution could satisfy both conditions. Yet the solver returned "Optimal" without warning.

Method            Energy    Constraints
PuLP/MILP         -2.5599   ✅ Satisfied
QAOA p=1          -2.5599   ⚠️ Violation
QAOA p=3          -2.5599   ⚠️ Violation

The contradiction went unnoticed because the solver’s output looked valid. No static analysis tool flags a model that claims optimality while violating its own constraints. The issue only surfaced when humans scrutinized the input data.

The third bug was subtler: the comparison metric was measuring the wrong thing. The team compared p=1 and p=3 by looking at the argmax-bit solution—the most likely bitstring output. For every value of p, this always returned [1,1,1,1,1], yielding identical results regardless of circuit depth. In reality, the quantum circuits were running differently, but the measurement method couldn’t detect the difference.

Why static checks fail—and what to do instead

All three bugs shared a common trait: the code was syntactically correct, logically structured, and visually polished. No linter would catch them. No architecture rule would trigger. Code review would likely approve it as "looks good to go."

The problem wasn’t in the code’s form—it was in its behavior.

Penalty weights that are too small to influence results
Models that return "optimal" for impossible solutions
Metrics that measure the wrong property

These issues can only be caught by executing the code and validating its outputs against intended behavior. Yet modern development workflows rely heavily on static analysis—tools that read code but don’t run it.

Introducing the Golden Demo: turning specs into executable tests

Most software projects start with a specification—a document describing what the software should do. But specifications are often static, written in prose, and disconnected from implementation. The Spec-Driven Development movement argues for executable specifications: small, deterministic reference implementations that represent the intended behavior.

The Golden Demo concept builds on this idea:

During planning, create a minimal golden example—an executable reference that solves a subset of the problem with known correct outputs.
After development, run both the golden example and the actual code against the same test vectors. Any divergence triggers a drift report.
At merge time, enforce that the code not only compiles and passes unit tests but also produces behavior consistent with the golden example.

In the routing project, a Golden Demo would have caught all three bugs:

Bug 1: A test vector requiring lambda > max_route_cost would have failed immediately when the penalty weight was set to 0.5.
Bug 2: A constraint check in the golden demo would have flagged the impossible demand-capacity mismatch before MILP even ran.
Bug 3: Comparing expectation energy (not argmax outputs) would have revealed differences between p=1 and p=3.

This approach doesn’t replace linters or security scans—it complements them. It shifts the focus from "does the code compile?" to "does it do what we intended?"

The future of AI-assisted development

As AI tools generate more code—from full applications to quantum algorithms—the risk of subtle, behavior-level bugs grows. Static analysis can’t catch them. Code reviews often miss them. And users certainly won’t tolerate them once they’re in production.

The solution lies in shifting from code correctness to behavioral correctness. Tools like the Golden Demo idea bridge the gap between specification and execution, ensuring that AI-generated code doesn’t just look good—it works as intended.

The era of polished but flawed AI code isn’t inevitable. It’s a solvable problem—one executable test at a time.

AI summary

Bir optimizasyon projesinde üç kritik hatanın ortaya çıkması, modern yazılım geliştirmenin en büyük kör noktasını gözler önüne serdi. Peki, hatalar neden fark edilmedi? İşte yanıtı ve önerilen çözüm yolu.

AI code is polished—but hidden bugs can slip through unseen

From 42% stockouts to a 51% cost cut

Quantum-inspired AI code with glaring contradictions

Silent contradictions and the wrong metric

Why static checks fail—and what to do instead

Introducing the Golden Demo: turning specs into executable tests

The future of AI-assisted development

Comments

Why your messy codebase makes AI tools stumble

How to Eliminate Static AWS Keys for Safer Cloud Deployments

Why 'Free' Local AI Executors Can Cost More Than Cloud Models