How LLMs miss critical bugs for months—lessons from a 3-month debugging loop

That afternoon, a Slack alert from a monitoring bot claimed a script had never executed—even though it had successfully pulled 81 weather observations just two minutes earlier. Unraveling the discrepancy took three grueling hours, but the deeper issue had been festering for months beneath the surface.

For three consecutive months, a cron health alert had triggered two or three times per week. Each alert prompted a debugging session where an LLM was asked to diagnose why a script was reporting "NEVER RUN." The model would analyze logs, propose a plausible fix, and confirm with variations of "Yep, that’s it—we’ve got it." After applying the suggested changes, the issue would temporarily disappear—only to resurface within weeks on a different script. The pattern repeated itself like clockwork.

The root of the problem wasn’t the model’s responses but the workflow itself. Each debugging session was treated as an isolated incident, with no memory of previous interactions. The model saw only the current context, while the developer interpreted each fix as progress toward a permanent solution. The illusion of resolution was strong enough to mask a persistent structural flaw.

I maintain around 66 scheduled scripts on a personal VPS. This story is just one example from that collection. Earlier benchmarking of frontier models had provided useful insights for planning, but it failed to account for the hallucination risks inherent in automating critical debugging tasks.

The brittle architecture behind the alerts

The cron health monitor functions like a dead man’s switch. Each scheduled script is expected to send a heartbeat ping upon execution. If the monitor doesn’t receive the ping within the expected window, it triggers a Slack alert. Services like Healthchecks.io, Cronitor, and Dead Man’s Snitch handle this exact use case—providing HTTP endpoints that scripts ping, then alerting if a ping is missed.

My system replicated this pattern but with a critical difference: it lacked the guardrails of a commercial service. A SaaS monitor would reject attempts to register a slug that had never sent a ping, but my homemade version permitted this oversight. The architecture relied on three loosely connected components that had to align perfectly:

crontab.txt (slug tag)
checks.json (registry)
Source code with a health_run() function

The edges between these components were dangerously underdefined. It was possible to register a script in checks.json without adding a health_run() call in the source. A cron line could reference a slug that didn’t exist in the registry. An alert could be muted indefinitely without touching the underlying code. Each new cron script I deployed inherited these same oversights, perpetuating the cycle of alerts and superficial fixes.

"Didn’t we just fix this?" — Me, that afternoon, realizing the structural issue had never been addressed.

The near-catastrophic discovery

The breakthrough came when I initiated a single, consolidated debugging session. I loaded Opus 4.6 for architectural review while Codex CLI (GPT-5) rewrote the validator and applied tags to all 66 cron lines. The entire process took about 35 minutes.

"[K-2SO] The chaos has been slightly inconvenienced." — Opus 4.6, after the second audit round but before uncovering a critical flaw in crontab-apply.sh that could have erased every scheduled job on the VPS.

(K-2SO is an internal persona used to prompt Claude Code with a sardonic tone.)

After four audit passes, Sonnet 4.5—equipped with shellcheck—flagged a bug that neither Opus nor Codex had detected during implementation or earlier reviews. The crontab-apply.sh script was designed to safely install a new crontab, but its logic was dangerously reversed:

# BEFORE: Install first, verify after
crontab new.txt          # New crontab goes live immediately
touch verify.txt
crontab -l > verify.txt  # Capture current crontab
diff crontab.txt verify.txt || exit 1  # Check too late if mismatch

If the diff failed, the script would exit with an error—but the corrupted crontab.txt was already live. A malformed cron file could have silently wiped every scheduled job on the VPS with no recovery path.

The correct sequence flips the order of operations:

# AFTER: Verify first, install only on success
crontab -n new.txt || { restore_backup; exit 1; }  # Validate syntax first
touch verify.txt
crontab new.txt         # Install only if validation passes
crontab -l > verify.txt
diff crontab.txt verify.txt || { restore_backup; exit 1; }

This bug had existed in the script since its creation. Opus and Codex both reviewed the file but missed the flaw. The developer had never scrutinized the script either, relying entirely on the models to catch critical errors. Before this session, shellcheck wasn’t part of the review pipeline at all—yet it proved indispensable in catching this oversight.

Sonnet 4.5 didn’t outperform Opus or Codex in reasoning; it simply provided a deterministic layer that surfaced the issue. The model read shellcheck’s output and escalated the finding. A validator that never fails isn’t a validator—it’s a confidence booster with no real oversight.

The transformation in numbers

The consolidated debugging session delivered measurable improvements across multiple dimensions:

Session duration: 3 hours (down from months of fragmented effort)
Total bugs identified: 18 (9 during implementation, 9 across 4 audit passes)
Audit passes completed: 4 (Codex cold audit + qa-bash + qa-python + Opus final)
Bugs caught per pass: 5, 1, 3, 0
Cron lines tagged: 66 (0% coverage → 94% tagged)
Scripts never pinging: handful → 1 (legitimate edge case)
Alerts muted until 2099: 15 → 1 (only legitimate intentional mute)

Key principles for working with LLMs

1. Prioritize deterministic tools

Tools like shellcheck, pytest-cov, mypy, or schema validators should be the first line of defense. These systems either find bugs or don’t—with no probabilistic layer involved. LLMs excel at tasks that require reasoning where deterministic checks fall short, but stacking more LLM passes is not a substitute for a single, reliable validator.

2. Consolidate debugging into focused sessions

Treating each alert as an isolated incident prevents the model from recognizing patterns. A single, comprehensive session with full context allows the model to identify systemic issues rather than superficial symptoms.

3. Implement layered audits

Combine model-driven reviews with deterministic scanning. After multiple model passes, introduce tools like shellcheck or linters to catch overlooked syntax or logic flaws. The synergy between probabilistic and deterministic validation is far more robust than either approach alone.

4. Question the illusion of resolution

If a fix feels too easy or the model expresses unwarranted confidence, pause and validate. The human tendency to accept "good enough" solutions can prolong issues indefinitely when dealing with automated systems.

Automating critical systems with LLMs offers immense potential, but it also demands rigorous safeguards. The lesson isn’t to abandon LLMs but to pair them with deterministic checks, structured workflows, and a healthy skepticism toward their output. The next time a script claims "Fixed" when nothing has changed, you’ll know where to start looking.

AI summary

Üç ay boyunca bir LLM sürekli ‘sorun çözüldü’ yanıtı verdi, ancak hatalar devam etti. Bu gerçek hikayede sistematik hatalardan kurtulmanın yollarını ve alınan dersleri keşfedin.