Graph RAG: Why Vector Search Alone Fails Multi-Hop Agent Queries

Q: When should I use Neo4j for RAG versus a standard vector database like Pinecone?

Use Neo4j when your query patterns require relationship traversal — 'What regulations apply to this drug given its molecule class and the jurisdictions where it's approved?' is a graph query, not a semantic search. Use a vector database like Pinecone when queries are primarily concept-matching against unstructured text and entities are largely independent. The strongest production architectures use both: vector search for semantic entry-point discovery, Neo4j for relational expansion.

Q: What are the main failure modes in a Graph RAG pipeline?

The three most common production failures are: (1) Entity extraction errors — NER models that miss or mis-classify entities create broken graph edges and silent retrieval gaps; (2) Schema drift — ingesting new document types without updating the graph schema breaks existing Cypher traversal queries; (3) Query translation failures — LLM-generated Cypher that looks syntactically valid but traverses the wrong relationship direction or uses incorrect property names, returning empty result sets rather than errors.

Q: How does Graph RAG integrate with LangGraph agents?

The standard integration pattern uses a custom LangGraph tool node that wraps a Neo4j retriever. The agent calls the tool with an entity name or question, the tool runs hybrid retrieval (vector for entry points, Cypher for expansion), and returns structured context as a ranked list of triples or subgraph summaries. LangGraph's persistent state checkpointing stores conversational history while Neo4j stores durable structured knowledge — they operate on different memory timescales and should not be conflated.

Vector similarity search answers “what text is most like this query?” — it cannot answer “what entities are connected to this entity, through which relationships, under what constraints?” That distinction seems academic until your AI agent tries to answer a compliance question that spans four entity types and three relationship hops. At that point, cosine distance produces confident-sounding hallucinations, and your pipeline has no mechanism to detect the failure.

Graph RAG resolves this by pairing a property graph — typically Neo4j — with your vector index. The vector index handles semantic entry-point discovery; the graph handles relational expansion. In a pharmaceutical regulatory deployment processing roughly 2M document ingestions per month, we found that hybrid Graph RAG answered multi-hop compliance queries with measurably fewer factual gaps than the vector-only baseline, because relationship traversal is deterministic where embedding similarity is probabilistic. This post walks through the production architecture, the Cypher patterns that actually work, and the failure modes you will hit before you expect them.

Graph-native traversal handles multi-hop entity chains far more directly than sequential vector lookups, because each extra semantic retrieval adds both latency and uncertainty.

Why Vector-Only RAG Breaks on Relational Queries

Standard RAG pipelines — as covered in our production-ready RAG checklist — retrieve the top-k semantically similar chunks and inject them as context. For single-concept questions this works. The failure surface is multi-hop relational queries: questions where the answer requires traversing a chain of entity relationships that are never co-located in a single document chunk.

Consider the query: “Which regulations apply to Drug X given its molecule class and the jurisdictions where it currently holds approval?” A vector search will retrieve chunks mentioning Drug X. It may or may not retrieve chunks about the relevant molecule class. It almost certainly won’t retrieve the jurisdiction-specific regulation that applies only because of the intersection of molecule class and approval status — that relationship exists in the corpus structure, not in any single passage’s embedding.

Vector-Only Retrieval — Multi-Hop Failure

Top-k = 5 chunks about “Drug X” returned. Molecule class relationship missing. Jurisdiction-regulation intersection not present in any single chunk. LLM synthesizes a plausible but factually incomplete answer. No retrieval error is raised — the pipeline believes it succeeded.

Hybrid Graph RAG — Explicit Traversal

Vector search seeds on “Drug X” node. Cypher traversal follows BELONGS_TO → MoleculeClass → GOVERNED_BY → Regulation and HAS_APPROVAL → Jurisdiction → ENFORCES → Regulation. The intersection of both paths is returned as structured triples, and the LLM receives a complete, verifiable relationship chain.

The second failure mode is subtler: Semantic Silence. When a vector retriever finds no chunks above its similarity threshold, it returns nothing — or worse, returns marginally relevant chunks anyway. The LLM has no signal that the retrieval was empty versus genuinely sparse. In a graph, a traversal that returns zero results is an explicit, detectable signal. Your agent can branch on empty-graph responses in a way it cannot branch on low-confidence vector scores.

Warning: A vector retriever returning top-k results with similarity scores of 0.61–0.65 looks identical in your pipeline to a retriever returning scores of 0.91–0.95. Without a calibrated similarity threshold and explicit “no confident results” branching, your agent will hallucinate answers to questions it effectively couldn’t retrieve. Graph traversal that returns an empty result set is semantically cleaner — empty is unambiguous.

Neo4j Graph Schema Design for Agent Memory

The graph schema is the most consequential design decision in a Graph RAG system — more consequential than the choice of embedding model or retrieval strategy. A schema designed for reporting queries (wide nodes, denormalized properties) will produce Cypher traversal patterns that are brittle and slow. A schema designed for traversal (normalized relationships, typed edges) will support the multi-hop patterns your agents actually need.

The core principle: relationships are first-class data, not foreign keys. Every relationship in your graph should encode a fact that agents need to traverse, not just a join condition.

// Schema: Pharmaceutical compliance knowledge graph
// Node types with vector embedding support

CREATE CONSTRAINT drug_id IF NOT EXISTS
FOR (d:Drug) REQUIRE d.id IS UNIQUE;

CREATE CONSTRAINT regulation_code IF NOT EXISTS
FOR (r:Regulation) REQUIRE r.code IS UNIQUE;

// Create vector index on Regulation nodes for semantic entry-point discovery
CREATE VECTOR INDEX regulation_embedding IF NOT EXISTS
FOR (r:Regulation) ON (r.embedding)
OPTIONS {indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}};

// Sample graph construction
MERGE (d:Drug {id: 'DRUG-001', name: 'Compound Alpha', smiles: 'CC(=O)Oc1ccccc1C(=O)O'})
MERGE (mc:MoleculeClass {id: 'MC-NSAID', category: 'NSAID'})
MERGE (j:Jurisdiction {id: 'JX-EU', name: 'European Union', region: 'EMEA'})
MERGE (r:Regulation {
code: 'EMA-2019-001',
text: 'NSAIDs approved in EMEA require cardiovascular risk labeling.',
embedding: null  // populated at ingestion time
})
MERGE (a:Approval {id: 'APR-001', date: '2021-03-15', status: 'ACTIVE'})

MERGE (d)-[:BELONGS_TO]->(mc)
MERGE (d)-[:HAS_APPROVAL]->(a)
MERGE (a)-[:IN_JURISDICTION]->(j)
MERGE (mc)-[:GOVERNED_BY]->(r)
MERGE (j)-[:ENFORCES]->(r);

Two schema decisions that matter in production: first, attach vector embeddings to the nodes that agents query semantically — Regulation nodes and Drug nodes in this schema, not to relationship edges. Second, keep relationship types semantically precise. GOVERNED_BY and ENFORCES encode different facts about the same regulation; collapsing them to a generic RELATED_TO destroys the traversal logic your agents depend on.

Diagram 2: Knowledge graph schema for a pharmaceutical compliance domain.

The Hybrid Retriever: Vector Entry Points + Cypher Expansion

The retrieval architecture that works in production is a two-phase hybrid: vector search identifies semantic entry-point nodes, then Cypher traversal expands the subgraph around those nodes. Neither phase is optional. Vector search without graph expansion misses relational context; graph traversal without vector entry points requires exact entity name matching, which fails on paraphrase and abbreviation.

We build this as a LangGraph tool node — see our LangGraph stateful workflows guide for the tool node pattern — that the agent calls with a natural language question. The tool handles both phases and returns merged, ranked context.

Diagram 1: Hybrid Graph RAG retrieval architecture.

from __future__ import annotations

import os
from typing import Any

from langchain_openai import OpenAIEmbeddings
from neo4j import GraphDatabase

class GraphRAGRetriever:
    """
    Hybrid retriever: vector search for semantic entry points,
    Cypher traversal for relational subgraph expansion.

    Production note: pre-warm the Neo4j connection pool at startup.
    Creating a new driver per query adds ~40ms overhead at p99 under load.
    """

    def __init__(self, neo4j_uri: str, neo4j_auth: tuple[str, str]) -> None:
        self.driver = GraphDatabase.driver(neo4j_uri, auth=neo4j_auth)
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        # Connection pool is initialized here, not at first query
        self.driver.verify_connectivity()

    def retrieve(self, query: str, top_k: int = 5, hop_depth: int = 2) -> list[dict[str, Any]]:
        """
        Phase 1: vector search for semantic entry points.
        Phase 2: Cypher traversal from entry points up to hop_depth.
        Returns merged, deduplicated context triples.
        """
        query_embedding = self.embeddings.embed_query(query)
        entry_nodes = self._vector_search(query_embedding, top_k=top_k)

        if not entry_nodes:
            # Explicit empty signal — do not fall through to hallucination
            return []

        subgraph = self._expand_subgraph(
            node_ids=[n["id"] for n in entry_nodes],
            hop_depth=hop_depth,
        )
        return self._merge_and_rank(entry_nodes, subgraph)

    def _vector_search(
        self, embedding: list[float], top_k: int
    ) -> list[dict[str, Any]]:
        """
        Semantic entry-point discovery via Neo4j vector index.
        Returns Regulation nodes most similar to the query.
        """
        with self.driver.session() as session:
            result = session.run(
                """
                CALL db.index.vector.queryNodes(
                    'regulation_embedding',
                    $top_k,
                    $embedding
                ) YIELD node, score
                WHERE score > 0.72
                RETURN node.id AS id,
                       node.code AS code,
                       node.text AS text,
                       score
                ORDER BY score DESC
                """,
                embedding=embedding,
                top_k=top_k,
            )
            return [dict(record) for record in result]

    def _expand_subgraph(
        self, node_ids: list[str], hop_depth: int
    ) -> list[dict[str, Any]]:
        """
        Cypher variable-length traversal from seed regulation nodes.
        Traverses back through MoleculeClass and Jurisdiction to Drug nodes.

        Warning: unbounded variable-length paths on large graphs will
        cause query timeouts. Always set an explicit depth ceiling.
        """
        with self.driver.session() as session:
            result = session.run(
                """
                MATCH (r:Regulation)
                WHERE r.id IN $node_ids
                MATCH path = (d:Drug)-[*1..$depth]-(r)
                WITH d, r,
                     [rel in relationships(path) | type(rel)] AS rel_chain,
                     [node in nodes(path) | labels(node)[0] + ': ' + coalesce(node.name, node.code, node.id)] AS node_chain
                RETURN d.name AS drug,
                       r.code AS regulation,
                       r.text AS regulation_text,
                       rel_chain,
                       node_chain
                LIMIT 50
                """,
                node_ids=node_ids,
                depth=hop_depth,
            )
            return [dict(record) for record in result]

    def _merge_and_rank(
        self,
        entry_nodes: list[dict[str, Any]],
        subgraph: list[dict[str, Any]],
    ) -> list[dict[str, Any]]:
        """
        Merge semantic scores with graph proximity.
        Subgraph triples that connect through high-score entry nodes rank higher.
        Deduplication by (drug, regulation) pair.
        """
        seen: set[tuple[str, str]] = set()
        merged: list[dict[str, Any]] = []

        score_map = {n["id"]: n["score"] for n in entry_nodes}

        for record in subgraph:
            key = (record.get("drug", ""), record.get("regulation", ""))
            if key in seen:
                continue
            seen.add(key)
            # Inherit semantic score from entry-point regulation
            record["relevance_score"] = score_map.get(record.get("regulation", ""), 0.0)
            merged.append(record)

        return sorted(merged, key=lambda x: x["relevance_score"], reverse=True)

    def close(self) -> None:
        self.driver.close()

Expert Insight: Set a Hard Similarity Threshold, Not Just Top-K The WHERE score > 0.72 filter in _vector_search is not arbitrary. Without it, Neo4j’s queryNodes returns top_k results regardless of quality, and you will expand subgraphs from irrelevant seed nodes. Tune the threshold on your own corpus before shipping to production.

Wiring the Retriever into a LangGraph Agent

The retriever above is stateless — it answers a single query against the graph. Connecting it to a LangGraph agent gives it access to conversational state, tool-call history, and the ability to issue follow-up traversal queries when the first result is insufficient. The pattern is a standard tool node, but the memory architecture splits across two stores: Neo4j holds durable structured knowledge, while LangGraph’s checkpoint store holds ephemeral conversational state. Conflating these two stores is the most common architectural mistake we see in Graph RAG deployments.

from __future__ import annotations

import os
from typing import Annotated, TypedDict

from langchain_core.messages import AIMessage, BaseMessage
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import END, START, StateGraph
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode

# _retriever instantiated at module load — connection pool pre-warmed
_retriever = GraphRAGRetriever(
    neo4j_uri=os.environ["NEO4J_URI"],
    neo4j_auth=(os.environ["NEO4J_USER"], os.environ["NEO4J_PASSWORD"]),
)

@tool
def graph_rag_search(query: str, hop_depth: int = 2) -> str:
    """
    Search the pharmaceutical knowledge graph for regulatory information.
    Use this tool when the question involves drug regulations, approvals,
    molecule classes, or jurisdiction-specific requirements.

    Args:
        query: Natural language question about drug regulations.
        hop_depth: Relationship traversal depth (1-3). Use 2 for most queries,
            3 only for cross-jurisdictional multi-drug comparisons.
    """
    results = _retriever.retrieve(query, hop_depth=hop_depth)

    if not results:
        return "No regulatory information found for this query in the knowledge graph."

    context_lines = []
    for r in results[:8]:  # cap context window contribution
        chain = " -> ".join(r.get("node_chain", []))
        context_lines.append(
            f"Drug: {r.get('drug', 'N/A')} | "
            f"Regulation: {r.get('regulation', 'N/A')} | "
            f"Path: {chain} | "
            f"Text: {r.get('regulation_text', '')[:200]}"
        )

    return "\n".join(context_lines)

tools = [graph_rag_search]
llm = ChatOpenAI(model="gpt-4o", temperature=0).bind_tools(tools)

class AgentState(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]

def agent_node(state: AgentState) -> AgentState:
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

def should_continue(state: AgentState) -> str:
    last = state["messages"][-1]
    if isinstance(last, AIMessage) and last.tool_calls:
        return "tools"
    return END

graph_builder = StateGraph(AgentState)
graph_builder.add_node("agent", agent_node)
graph_builder.add_node("tools", ToolNode(tools))
graph_builder.add_edge(START, "agent")
graph_builder.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END})
graph_builder.add_edge("tools", "agent")

# MemorySaver for conversational state — separate from Neo4j's structured knowledge
checkpointer = MemorySaver()
graph = graph_builder.compile(checkpointer=checkpointer)

In Graph RAG architectures, LangGraph’s checkpoint store and Neo4j serve fundamentally different memory functions: checkpoints hold ephemeral conversational state with a session lifetime, while the knowledge graph holds durable structured facts with a corpus lifetime. Merging them into a single store couples retrieval latency to conversation history size.

Entity Extraction: The Failure Surface Nobody Talks About

Graph RAG gets extensive coverage at the retrieval layer. The ingestion layer — specifically entity extraction — is where production pipelines actually break. Every relationship in your Neo4j graph was created by an entity extraction step that ran over source documents. If that step mis-classifies an entity type, assigns the wrong relationship direction, or normalizes entity names inconsistently (“Compound Alpha” vs “compound-alpha” vs “ALPHA”), the graph edge is broken. The retrieval layer has no way to detect or repair a broken edge — it simply returns nothing, or worse, traverses to the wrong node.

In a financial document processing deployment handling inbound regulatory filings, our team found that a NER model with roughly 6–7% entity classification error rate on domain-specific abbreviations produced broken graph edges for approximately 1 in 15 ingest batches. The symptom at query time was not an error — it was a silent retrieval gap where agents returned answers that omitted material facts. The fix was a two-stage extraction pipeline: Claude Sonnet 4.6 for initial entity extraction with structured output, followed by a graph consistency check that validated relationship directionality against the schema before committing edges.

Warning: When your agent generates Cypher queries dynamically from natural language, syntactically valid queries that traverse the wrong relationship direction or reference a property that doesn’t exist on a given node type return empty result sets — not errors. This looks identical to “no data found” at the application layer. Production guard: validate LLM-generated Cypher against your schema using EXPLAIN before execution, and log every zero-result query for human review. Zero-result frequency above roughly 10% of queries indicates a schema-query mismatch, not sparse data.

What Breaks at Scale

The architecture above works cleanly at tens of thousands of nodes. The failure modes emerge at production scale — graphs above 5M nodes with high concurrent query load — and most of them are invisible in development.

Schema Drift is the most operationally painful failure mode. When you ingest a new document type that introduces a new entity category, your existing Cypher traversal queries silently miss it because they were written for the original schema. We discovered this in a deployment where a document category change introduced ClinicalTrial nodes that the existing traversal queries never reached, because no relationship path connected them to the Drug entry nodes the agent was seeding from. Six weeks of ingested trials were unreachable until the schema and traversal queries were updated together.

Variable-length path explosion is the performance failure mode. A Cypher query with [*1..4] (unbounded up to depth 4) on a dense graph can return millions of paths before the LIMIT clause fires. Neo4j evaluates paths before filtering, so the query planner materializes the full path set in memory. Always set hop_depth to 2 for standard agent queries and 3 only for explicitly cross-domain questions — and enforce this ceiling in your tool definition, not as a suggestion to the LLM.

Production Observation

3-hop max

In production Graph RAG deployments, the vast majority of answerable agent queries resolve within 2 relationship hops. Queries that genuinely require 4+ hops almost always indicate a schema design problem — the answer should be reachable more directly, or the question is outside the graph’s knowledge boundary.

Embedding staleness is the failure mode with the longest lag before detection. When documents are updated but their graph node embeddings are not re-indexed, the vector search phase returns stale entry points. The graph traversal then expands from nodes whose semantic content no longer matches the source document, producing answers that were accurate six months ago. You need an embedding refresh pipeline tied to document update events, not a batch job running on a fixed schedule.

Failure Mode	Detection Signal	Production Fix	Latency to Detection
Entity extraction errors	Zero-result query rate > 10%	Two-stage extraction + schema validation before commit	Days to weeks
Schema drift	New entity type queries return empty	Schema-versioned Cypher queries + migration tests on ingest	Weeks to months
Path explosion	Query p99 latency spike, heap pressure	Hard hop-depth ceiling in tool layer, EXPLAIN validation	Minutes (immediate)
Embedding staleness	Semantic drift in retrieval quality metrics	Event-driven re-embedding on document update	Weeks to months
LLM-generated Cypher mismatch	High zero-result rate on dynamic queries	Schema injection into Cypher-generation prompt + EXPLAIN guard	Hours to days

The underlying pattern across all five failure modes: Graph RAG moves complexity from the retrieval layer to the ingestion and schema maintenance layers. Vector-only RAG fails loudly at retrieval time. Graph RAG fails quietly at ingestion time, sometimes weeks before the failure surfaces in agent output quality. Build your observability around ingestion pipeline health — zero-result query rates, entity extraction confidence distributions, and schema coverage metrics — not just retrieval latency and LLM response quality.

For teams integrating Graph RAG into broader data pipelines, the knowledge graph is a derived artifact of your document corpus — treating it as a static store rather than a continuously maintained data product is the architectural decision that causes most of the failures above. The data engineering foundation for AI agents applies here: your graph is only as current and correct as the pipeline that populates it.

Frequently Asked Questions

What is Graph RAG and how does it differ from standard RAG?

Graph RAG augments retrieval-augmented generation by storing knowledge in a property graph (such as Neo4j) rather than — or in addition to — a flat vector index. Standard RAG retrieves document chunks by semantic similarity, which works for single-concept lookups but breaks on multi-hop questions that require traversing relationships between entities. Graph RAG uses graph traversal (Cypher queries) to follow entity relationships explicitly, enabling answers that require chaining facts across multiple nodes.

When should I use Neo4j for RAG versus a standard vector database like Pinecone?

Use Neo4j when your query patterns require relationship traversal — “What regulations apply to this drug given its molecule class and the jurisdictions where it’s approved?” is a graph query, not a semantic search. Use a vector database like Pinecone when queries are primarily concept-matching against unstructured text and entities are largely independent. The strongest production architectures use both: vector search for semantic entry-point discovery, Neo4j for relational expansion. For a deeper comparison of retrieval strategies, see our RAG vs. fine-tuning decision guide.

What are the main failure modes in a Graph RAG pipeline?

The three most common production failures are: (1) entity extraction errors — NER models that miss or mis-classify entities create broken graph edges and silent retrieval gaps; (2) schema drift — ingesting new document types without updating the graph schema breaks existing Cypher traversal queries; (3) query translation failures — LLM-generated Cypher that looks syntactically valid but traverses the wrong relationship direction or uses incorrect property names, returning empty result sets rather than errors.

How does Graph RAG integrate with LangGraph agents?

The standard integration uses a custom LangGraph tool node that wraps a Neo4j retriever. The agent calls the tool with a natural language question, the tool runs hybrid retrieval (vector for entry points, Cypher for expansion), and returns structured context as ranked triples or subgraph summaries. LangGraph’s persistent state checkpointing stores conversational history while Neo4j stores durable structured knowledge — they operate on different memory timescales and should not be conflated.

Engineer Intelligence with ActiveWizards

Building a Graph RAG pipeline that needs to survive multi-hop queries, schema evolution, and production ingestion load? Our team has deployed knowledge graph architectures at scale across regulated industries — we know where the silent failures hide.

Graph RAG: Why Vector Search Alone Fails Multi-Hop Agent Queries

Why Vector-Only RAG Breaks on Relational Queries

Neo4j Graph Schema Design for Agent Memory

The Hybrid Retriever: Vector Entry Points + Cypher Expansion

Wiring the Retriever into a LangGraph Agent

Entity Extraction: The Failure Surface Nobody Talks About

What Breaks at Scale

Frequently Asked Questions

What is Graph RAG and how does it differ from standard RAG?

When should I use Neo4j for RAG versus a standard vector database like Pinecone?

What are the main failure modes in a Graph RAG pipeline?

How does Graph RAG integrate with LangGraph agents?

Engineer Intelligence with ActiveWizards

Deploy this architecture

Igor Bobriakov

AI Agents & Autonomous Systems

Codebase Analysis Agent: 30 Seconds to First Answer

Aporia: Modular OSINT Engine for Security Research

Autonomous Content Engine with Multi-Model LLM Pipeline

Related Articles

The Self-Correcting RAG Pipeline: A Critic Agent in LangGraph

Context Engineering for Production Agents: The Discipline Replacing Prompt Engineering

Your Highest-Value Workflows Are the Hardest to Automate