#4113 Step-by-Step KG Queries

#4113 Describe step-by-step how to query a Neo4j knowledge graph containing nodes, edges and embedded documents from Python

Querying a Neo4j knowledge graph—especially one that mixes structured graph data (nodes/edges) with unstructured data (embedded documents)—requires a blend of the Neo4j Python Driver and, typically, a Vector Index for the embeddings.

Here is the step-by-step workflow to get your Python environment talking to your graph.

1. Set Up Your Environment

First, install the official driver. If you are handling embeddings, you’ll likely need a library like openai or sentence-transformers to process your query text into a vector.

pip install neo4j

2. Establish the Connection

You need to create a GraphDatabase driver instance. It’s best practice to manage this connection via a class or a context manager to ensure the session closes properly.

from neo4j import GraphDatabase

class KnowledgeGraphQuery:
    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def close(self):
        self.driver.close()

3. Basic Cypher Query (Nodes & Edges)

To query standard relationships (e.g., "Find all documents written by Author X"), you use the .run() method within a session.

def get_docs_by_author(self, author_name):
    query = """
    MATCH (a:Author {name: $name})-[:WROTE]->(d:Document)
    RETURN d.title AS title, d.summary AS summary
    """
    with self.driver.session() as session:
        result = session.run(query, name=author_name)
        return [record.data() for record in result]

4. Querying Embedded Documents (Vector Search)

If your documents contain high-dimensional vectors (embeddings), you don't use a standard MATCH. Instead, you use Neo4j’s Vector Index.

Note: This assumes you have already created a vector index in Neo4j (e.g., named document_embeddings).

def search_semantic_docs(self, query_vector, top_k=3):
    query = """
    CALL db.index.vector.queryNodes('document_embeddings', $k, $vector)
    YIELD node AS doc, score
    MATCH (doc)-[:HAS_TOPIC]->(t:Topic)
    RETURN doc.title AS title, score, collect(t.name) AS topics
    """
    with self.driver.session() as session:
        result = session.run(query, vector=query_vector, k=top_k)
        return [record.data() for record in result]

5. The Full Workflow Example

Here is how you pull it all together:

Step	Action	Description
1	Embed Query	Convert the user's natural language question into a vector.
2	Vector Search	Find the top $N$ most similar document nodes.
3	Graph Traversal	Traverse edges from those documents to find related entities (Authors, Topics, Dates).
4	Return/Augment	Use the results to answer the query or feed into an LLM (RAG).

Pro-Tip: Handling Data Types

Neo4j stores integers and floats specifically. When passing parameters from Python:

Lists of floats are required for vector parameters.
Dictionaries are best for node properties.
Results should be converted using .data() or .values() to move them from Neo4j objects into standard Python types.

Would you like me to show you how to write the Cypher command to create the Vector Index for your embedded documents first?

Describe how to use the LLM to inform the Cypher query and for the Cypher query results to inform a follow-up LLM prompt for interpretation

This workflow is the backbone of GraphRAG (Graph Retrieval-Augmented Generation). It turns a natural language question into a precise database query and then translates the raw data back into a human-friendly answer.

Here is the two-stage "sandwich" approach:

Stage 1: Natural Language to Cypher (The "Text-to-Graph" Step)

Instead of hard-coding queries, you use the LLM as a translator. For this to work, the LLM needs to know your Schema (the labels, properties, and relationship types in your graph).

The Workflow:

Extract Schema: Pull the graph metadata (e.g., CALL apoc.meta.schema()).
Prompt the LLM: Provide the schema and the user's question.
Sanitize: Clean the LLM’s output to ensure it's a valid Cypher string.

Example Prompt Logic:

"You are a Cypher expert. Given this schema: (User)-[:PURCHASED]->(Product).
Convert this question to Cypher: 'What products did users from Texas buy in 2025?'
Return ONLY the Cypher query."

Stage 2: Results to Interpretation (The "Graph-to-Text" Step)

Once Neo4j returns a list of nodes, edges, or paths, they are often in a messy JSON format. You feed this "context" back to the LLM to synthesize the final answer.

Step-by-Step Implementation in Python

# 1. GENERATE THE QUERY
schema_info = "Node: Document {title, text_vector}, Node: Concept {name}. Edge: MENTIONS"
user_prompt = "Find documents that mention 'Quantum Computing' and summarize their connection."

# LLM call 1: Returns "MATCH (d:Document)-[:MENTIONS]->(c:Concept {name: 'Quantum Computing'}) RETURN d.title"
cypher_query = llm.generate_cypher(user_prompt, schema_info)

# 2. EXECUTE IN NEO4J
with driver.session() as session:
    graph_results = session.run(cypher_query).data()

# 3. INTERPRET WITH LLM
interpretation_prompt = f"""
The user asked: {user_prompt}
I found the following data in the knowledge graph: {graph_results}

Based on these specific database results, provide a clear, conversational answer. 
If the data is empty, say you don't know.
"""

final_answer = llm.ask(interpretation_prompt)
print(final_answer)

Why this "Loop" is Powerful

By using this bi-directional flow, you solve the two biggest problems in AI search:

Problem	Solution with Graph + LLM
Hallucination	The LLM is forced to base its answer on real facts retrieved from your nodes/edges.
Complex Reasoning	The LLM can't "see" relationships easily; Cypher can traverse 5+ hops of relationships to find hidden patterns the LLM would miss.
Semantic Gap	Cypher is rigid; the LLM helps by "fuzzy matching" user intent to the correct graph labels.

Key Considerations

Prompt Injection: Never execute Cypher generated by an LLM directly in a production environment without a "read-only" user role or a validation layer to prevent DETACH DELETE commands.
Context Window: If your Cypher query returns 500 nodes, don't dump all of them into the follow-up prompt. Use Python to filter for the most relevant fields first.

Would you like a specific system prompt template to help the LLM generate more accurate Cypher queries?

From <https://gemini.google.com/app/> Google Gemini 3.0 Pro (Thinking)

Artemis III Integrated Hazard Reports

Schema:

Entities: Program, Mission, Hazard_Report, Subsystem Hazard_Cause, Hazard_Cause_Control, Hazard_Cause_Control_Verif

Relationships:

(Program)-[:HAS_MISSION]->(Mission)

(Mission)-[:HAS_SUBSYSTEM]->(Subsystem)

(Mission)-[:HAS_HR]->(Hazard_Report)

(Hazard_Report)-[:REFS_SUBSYSTEM]->(Subsystem)

(Hazard_Report)-[:HAS_CAUSE]->(Hazard_Cause)

(Hazard_Cause)-[:HAS_CONTROL]->(Hazard_Cause_Control)

(Hazard_Cause_Control)-[:HAS_VERIF]->(Hazard_Cause_Control_Verif)

(Mission)-[:HAS_DOC]->(Document)

(Hazard_Report)-[:HAS_DOC]->(Document)

(Hazard_Cause)-[:HAS_DOC]->(Document)

(Hazard_Cause_Control)-[:HAS_DOC]->(Document)

(Hazard_Cause_Control_Verif)-[:HAS_DOC]->(Document)

Search This Blog

Learning from ChatGPT