How we replaced keyword tricks with real AI logic in our agent system

A recent deep dive into our AI agent’s core logic exposed a troubling truth: nearly two-thirds of its automated behaviors weren’t actually intelligent. They relied on oversimplified string comparisons and keyword tricks disguised as sophisticated functions. The discovery forced us to rebuild the entire system around a single, shared semantic layer—leading to measurable improvements in task recognition, feedback analysis, and contradiction detection.

The moment of reckoning: when shortcuts stopped working

The breaking point arrived when the agent failed to recognize two nearly identical requests: “optimize database queries” and “speed up SQL performance.” Both clearly described the same goal, yet the system returned a similarity score of zero. A quick audit of the remaining 21 primitive behaviors revealed a pattern of deception. Ten were merely keyword matches dressed in technical jargon, while five were outright theater—such as an “adversarial analysis” function that reduced sophisticated logic to a hash-based boolean check.

Most of these behaviors were performing elementary string operations in code that looked advanced. The contradiction detector searched for the word “not” near other terms. The deduplication primitive compared tasks using exact string matches after lowercasing. Even the similarity checker used Jaccard similarity, which ignored synonyms and paraphrases entirely. These weren’t broken features; they were elaborate façades.

Building a unified semantic foundation instead of patching individual flaws

The initial instinct was to upgrade each flawed primitive separately—adding TF-IDF vectors here, WordNet synonyms there, a more advanced POS tagger elsewhere. But maintaining ten distinct NLP pipelines would introduce new complexity, version conflicts, and edge cases. A better solution emerged: consolidate all semantic processing into a single, shared embedding module.

We selected the all-MiniLM-L6-v2 model from the sentence-transformers library, a compact 22MB neural network that generates 384-dimensional vector embeddings for text. On a standard AMD Ryzen 5 machine with 14GB RAM and no dedicated GPU, the model processes a single sentence in about 80 milliseconds. While not fast enough for real-time chat interactions, it’s perfectly suited for background analysis and batch operations that characterize our system’s primitives.

The core semantic module is straightforward but robust:

from functools import lru_cache

try:
    from sentence_transformers import SentenceTransformer
    _model = SentenceTransformer('all-MiniLM-L6-v2')
    _SEMANTIC_AVAILABLE = True
except ImportError:
    _SEMANTIC_AVAILABLE = False

@lru_cache(maxsize=512)
def embed(text: str):
    if not _SEMANTIC_AVAILABLE:
        return None
    return _model.encode(text, normalize_embeddings=True)

def similarity(a: str, b: str) -> float:
    ea, eb = embed(a), embed(b)
    if ea is None or eb is None:
        # Fallback to Jaccard similarity
        sa, sb = set(a.lower().split()), set(b.lower().split())
        return len(sa & sb) / max(len(sa | sb), 1)
    return float(ea @ eb)

Three design choices proved critical:

The try/except import pattern ensures graceful degradation. If the model fails to load or the package is missing, every primitive automatically falls back to the original Jaccard similarity. The system remains functional, albeit less intelligent, which is essential for constrained deployment environments.

The lru_cache with 512 entries drastically reduces redundant computations. In practice, task descriptions and feedback snippets repeat frequently during a session, achieving a 60–70% cache hit rate. Cached lookups drop from 80ms to nearly zero, accelerating the entire pipeline.

Normalized embeddings allow the dot product (ea @ eb) to function as cosine similarity directly, eliminating the need for separate normalization steps and simplifying the codebase.

The dramatic improvement: from zeros to meaningful scores

The transformation in performance was more dramatic than anticipated. When comparing semantically equivalent but lexically distinct phrases, the new system produced scores that aligned with human intuition—while the old keyword-based approach returned zeros across the board.

Here’s a comparison of old versus new similarity scores for common task variations:

“optimize database queries” vs “speed up SQL performance”: 0.000 → 0.736
“fix the login bug” vs “users can’t sign in”: 0.000 → 0.682
“refactor auth module” vs “clean up authentication code”: 0.250 → 0.814
“add dark mode” vs “implement dark theme”: 0.000 → 0.891
“improve error messages” vs “better error handling”: 0.167 → 0.593
“update dependencies” vs “bump package versions”: 0.000 → 0.547

The new scores aren’t perfect—0.547 for “update dependencies” versus “bump package versions” still leaves room for refinement. However, they’re consistently higher than zeros, and the primitives can now operate effectively with tuned thresholds. For example, deduplication uses a 0.75 threshold to avoid merging unrelated tasks, while similarity detection uses 0.60 to favor over-suggestion over missed matches.

The contradiction detector now combines a two-stage process: first filtering high-similarity pairs using semantic similarity, then applying logical analysis. This approach outperforms the previous keyword-based “look for the word ‘not’” method and reduces false positives across the board.

Lessons learned: simplicity scales, complexity collapses

The journey from keyword-based illusions to a unified semantic layer revealed a fundamental truth about agent systems: modularity without semantic unity breeds maintenance hell. Each specialized NLP pipeline introduced hidden costs—downloads, dependencies, fallback logic, and edge cases that multiplied with every primitive.

By consolidating semantic processing into a single module, we reduced 21 independent pipelines to one that all primitives share. The system became easier to maintain, more reliable, and significantly more accurate. Most importantly, it finally began to function as intended—not as a collection of clever string tricks, but as an agent capable of genuine understanding.

Moving forward, we’ll extend this semantic layer to additional components and explore model quantization to reduce latency further. The goal isn’t just to make the agent smarter—it’s to make it genuinely intelligent in the way it was always meant to be.

AI summary

AI sistemlerinizin sadece kelime eşleştirmeyle çalıştığını mı düşünüyorsunuz? Tek bir semantik katman ekleyerek performansını nasıl artırabileceğinizi keşfedin.

How we replaced keyword tricks with real AI logic in our agent system

The moment of reckoning: when shortcuts stopped working

Building a unified semantic foundation instead of patching individual flaws

The dramatic improvement: from zeros to meaningful scores

Lessons learned: simplicity scales, complexity collapses

Comments

A developer’s guide to launching your first npm package without pitfalls

Simplify Go Dependency Injection with Parsley: A Beginner's Guide

Hermes Agent: A Persistent AI Runtime Beyond Chatbots