iToverDose/Startups· 28 MAY 2026 · 16:01

How SQL query logs can stop AI agents from inventing fake joins

Enterprise AI agents often return wrong answers when they guess table relationships in complex databases. New tools now mine actual query history to guide these agents toward reliable data paths.

VentureBeat3 min read0 Comments

AI agents struggle when left to navigate sprawling database ecosystems without guidance. When Miro’s analytics team let agents query its Snowflake warehouse directly, the results were wrong more than 65% of the time. The issue wasn’t the underlying model—it was the absence of context. With over 10,000 tables and no semantic framework to clarify which data served which business questions, agents made poor routing decisions.

DataHub is addressing this gap with a new context intelligence layer launching Thursday. Instead of relying on raw schema or static metadata, the platform analyzes years of SQL query history to build a semantic index that agents can query in real time. The capability integrates with popular agent frameworks including MCP, LangChain, Google’s Agent Development Kit, and CrewAI.

Behind this innovation is DataHub’s decade-long expertise in metadata management. Founded by former LinkedIn data infrastructure leaders, the company evolved from an open-source project into an enterprise-grade solution with over 15,000 contributors and 3,000 production deployments worldwide. Shirshanka Das, co-founder and CTO, emphasized the breakthrough during an interview with VentureBeat: "For the first time, enterprises can convert years of analyst query history into a living knowledge base where agents stop inventing joins because they reference real, validated query patterns."

From lineage tracking to agent navigation

DataHub originated at LinkedIn as a dual-purpose project: simplify data discovery across teams while ensuring proper usage standards. After nearly six years of internal development, the team open-sourced the platform in early 2020. Its core function—lineage tracking—remains critical for compliance audits, incident response, and onboarding new engineers.

The platform’s strength lies in its broad connectivity. Postgres ranks as the most frequently connected source globally, followed by MySQL, Oracle, and major cloud warehouses such as Snowflake and Google BigQuery. With support for over 100 metadata sources, DataHub has built robust infrastructure for extracting and parsing SQL logs—capabilities now repurposed for agent context.

Das noted a fundamental shift in data consumption: "The interface has moved from humans to autonomous agents."

Filtering noise to uncover proven query patterns

Context Intelligence operates as a new capability layer atop DataHub’s existing open-source foundation. Its foundation isn’t new; the platform has spent years parsing warehouse query logs for lineage tracking. Now, it applies the same infrastructure to build a semantic index that agents can query at runtime.

The process begins by filtering warehouse logs to isolate high-value queries. DataHub targets "golden queries"—frequently executed analyst queries and scheduled pipelines that encode proven business logic. These patterns form the basis for semantic anchors—structured definitions that agents can retrieve before generating SQL.

Das described the transformation as "inverting text to SQL," where patterns extracted from validated queries replace guesswork with reliable routing instructions. The system also incorporates human validation through Context Hub, allowing domain experts to review AI-generated context, resolve conflicting definitions, and simulate changes before publishing.

Miro’s 10,000-table Snowflake challenge

Miro, the digital whiteboard platform, encountered this problem firsthand while testing analytics agents against its Snowflake environment. Ronald Angel, Miro’s data platform product manager, explained that exposing raw schema with over 10,000 tables led agents to incorrect answers more than 65% of the time. The scale overwhelmed the agents’ ability to determine valid table relationships.

The solution involved rearchitecting data access around well-defined data products. Miro’s current setup routes user requests from interfaces like Claude Chat through a context layer where DataHub’s MCP maps natural language to the correct data assets. The system then hands off to Snowflake’s MCP for SQL generation.

Angel highlighted the importance of semantic signals: the context layer pulls metadata, entity relationships, query history, and business intent for each table—specifically which business question each entity addresses. These signals enable the agent to identify the right database entities before writing SQL rather than relying on schema alone.

Context stack integration across enterprise vendors

Several data vendors now offer contextual memory capabilities to support agentic AI. Pinecone, Oracle, and Redis have all introduced layers that store and retrieve structured context for AI systems. On the platform side, Microsoft’s Fabric IQ consolidates enterprise data and AI workflows into a unified environment.

DataHub’s approach differs by focusing on query history rather than static embeddings or manual annotations. By leveraging real analyst behavior, the platform provides agents with actionable context derived from production-proven patterns. This method reduces hallucinations in joins and calculations while maintaining compatibility with existing agent frameworks.

As AI agents proliferate in enterprise settings, the demand for reliable, context-aware navigation will only grow. Tools that bridge the gap between raw data and agent reasoning—like DataHub’s Context Intelligence—could become essential infrastructure for accurate, explainable analytics.

AI summary

AI ajanları veri tabanlarında %65 hata yapıyordu. DataHub’un Context Intelligence katmanı, yıllık SQL sorgularını analiz ederek ajanlara rehberlik eden semantik bir endeks sunuyor ve hataları minimize ediyor.

Comments

00
LEAVE A COMMENT
ID #YT01FW

0 / 1200 CHARACTERS

Human check

2 + 9 = ?

Will appear after editor review

Moderation · Spam protection active

No approved comments yet. Be first.