Benchmarking GraphRAG vs Traditional RAG for Indian Health Research

Researchers are developing a rigorous benchmarking platform to evaluate how different AI retrieval pipelines handle multi-hop medical questions in Indian public health literature. The system tests three approaches—LLM-only, traditional RAG with vector search, and GraphRAG using knowledge graphs—on a corpus of over 9,000 papers covering diabetes, tuberculosis, maternal health, and malaria.

The Challenge of Multi-Hop Medical Queries

Traditional retrieval-augmented generation (RAG) systems often struggle with questions that require connecting seemingly unrelated medical concepts. For example, when asked how diabetes affects tuberculosis treatment outcomes, a standard RAG system might return relevant chunks about diabetes, TB, and HbA1c levels. However, it fails to recognize the critical relationships between these concepts, which are essential to answering the question comprehensively.

Three key failure modes emerge in such scenarios:

Indirect relationships remain invisible: A query about rifampicin's impact on glycemic control in diabetic TB patients requires linking enzyme induction to glucose metabolism—information that may not appear in any single paper.

Entity roles get confused: Questions about MDR-TB treatment in pediatric patients might retrieve adult-focused studies due to similar keywords but different population contexts.

Corpus-wide aggregation becomes impossible: Queries asking for the most common comorbidities in Indian TB literature cannot be answered by examining individual chunks alone; they require synthesizing information across the entire corpus.

Building a Dedicated Benchmarking Platform

The platform evaluates three retrieval strategies using the same large language model (LLM) and identical queries to ensure fair comparison:

LLM-only: No retrieval, relying solely on the model's training data
Basic RAG: Uses FAISS for vector search with cross-encoder reranking
GraphRAG: Implements TigerGraph for multi-hop traversal of knowledge graphs

Performance metrics include token usage, computational cost, response latency, LLM-as-a-Judge quality assessments, and BERTScore F1 measurements. While the corpus ingestion pipeline is complete, researchers are currently populating the Indian public health research database from PubMed Central.

Constructing the Indian Health Research Corpus

The team leveraged PubMed's E-utilities API with domain-specific MeSH queries to compile papers from Indian institutions. The Python implementation efficiently handles batch processing and caching:

from Bio import Entrez

Entrez.email = "your@email.com"

def fetch_pmids(domain_query: str, max_results: int = 3000) -> list[str]:
    handle = Entrez.esearch(
        db="pmc",
        term=domain_query,
        usehistory="y",
        retmax=0
    )
    search_results = Entrez.read(handle)
    handle.close()
    
    web_env = search_results["WebEnv"]
    query_key = search_results["QueryKey"]
    total = int(search_results["Count"])
    
    pmids = []
    batch_size = 200
    for start in range(0, min(total, max_results), batch_size):
        fetch_handle = Entrez.efetch(
            db="pmc",
            rettype="xml",
            retmode="xml",
            retstart=start,
            retmax=batch_size,
            webenv=web_env,
            query_key=query_key
        )
        records = Entrez.read(fetch_handle)
        fetch_handle.close()
        pmids.extend([r["MedlineCitation"]["PMID"] for r in records["PubmedArticle"]])
    return pmids

A sample query for tuberculosis research incorporated MeSH terms and institutional filters:

(tuberculosis[MeSH] OR "TB"[tiab] OR "MDR-TB"[tiab]) AND ("India"[Affiliation] OR "Indian"[Affiliation]) AND (epidemiology[MeSH] OR "public health"[tiab] OR "clinical trial"[tiab])

During corpus construction, several challenges emerged:

Approximately 8% of papers lacked abstracts, requiring fallback to full-text extraction
Affiliation strings varied widely (AIIMS, All India Institute of Medical Sciences, New Delhi 110029) and needed standardization
Duplicate papers from PMC versioning required careful deduplication
Retraction Watch cross-references became essential for maintaining medical integrity

Affiliation filtering used multi-pass regex patterns covering country mentions, known Indian institution abbreviations (AIIMS, JIPMER, ICMR, PGIMER, NIMHANS, CMC Vellore), and major city names, achieving a 2-3% false positive rate.

Knowledge Graph Design Determines Retrieval Success

The knowledge graph schema represents the most critical architectural decision. An overly sparse graph yields no results, while an overloaded one produces spurious connections. The final design includes 10 vertex types and 10 edge types:

Vertex types:

Disease
Treatment
Biomarker
Population
GeographicRegion
Intervention
Outcome
Study
Institution
Comorbidity

Edge types with semantic meaning:

TREATS: Treatment → Disease
ASSOCIATED_WITH: Disease ↔ Disease or Disease ↔ Biomarker
MEASURED_BY: Disease → Biomarker
RISK_FACTOR_FOR: Biomarker/Population → Disease
COMPLICATES: Disease ↔ Disease (bidirectional)
REPORTS_OUTCOME: Study → Outcome
STUDIED_IN: Study → Population/GeographicRegion
CO_OCCURS_WITH: Disease ↔ Disease (same study context)
CONDUCTED_BY: Study → Institution
PART_OF: GeographicRegion → GeographicRegion (hierarchy)

Each edge carries a confidence score from the extraction model, with edges below 0.65 filtered during retrieval. This quality gate proved essential for meaningful traversal. The final graph contains 17,830 vertices, 142,000 edges, an average vertex degree of 8.0, and a graph diameter of approximately six hops.

Multi-Hop Retrieval in Practice

Consider the query: What is the impact of diabetes on TB treatment outcomes in India?

The GraphRAG system first extracts entities from the query using spaCy's biomedical model. The traversal then follows semantic paths through the knowledge graph, connecting diabetes treatments to TB outcomes via intermediate nodes like HbA1c levels and patient populations. This approach mirrors how medical researchers naturally synthesize information across multiple studies.

While benchmark numbers are pending, early prototypes demonstrated that graph-based retrieval maintains semantic relationships that traditional vector search cannot capture. The team anticipates that GraphRAG will show superior performance on complex medical queries requiring multi-hop reasoning, though comprehensive testing awaits completion of the Indian health corpus.

The platform represents an important step toward developing AI systems that can match the nuanced understanding required for medical research literature, particularly in specialized domains like Indian public health.

AI summary

Graf tabanlı arama sistemleri, geleneksel arama sistemlerine kıyasla daha iyi performans gösterir ve özellikle, çok adımlı sorulara cevap vermek için gerekli olan kavramlar arasındaki ilişkileri anlamakta daha başarılıdırlar.

Benchmarking GraphRAG vs Traditional RAG for Indian Health Research

The Challenge of Multi-Hop Medical Queries

Building a Dedicated Benchmarking Platform

Constructing the Indian Health Research Corpus

Knowledge Graph Design Determines Retrieval Success

Multi-Hop Retrieval in Practice

Comments

How to Build a Daily Puzzle Site: Key Tech Stack Insights

Build cleaner TypeScript logic with method chaining pattern matching

How AI Transforms Incident Response with Smart Root-Cause Analysis