#4161 Knowledge Graph of U.S. Constitution and Bill of Rights; +Website
#4161 Knowledge Graph of U.S. Constitution and Bill of Rights; +Website
#4161 Explain step-by-step how to implement GraphRAG in Python. Include sequence of queries, with both pre-processing and post-processing.
Implementing GraphRAG (Graph Retrieval-Augmented Generation) is a powerful way to enhance standard RAG systems. While standard RAG retrieves isolated text chunks using vector similarity, GraphRAG builds a Knowledge Graph (KG) from your documents, allowing the LLM to understand complex, multi-hop relationships and answer broad, thematic questions ("Global Search") as well as specific entity questions ("Local Search").
Here is a step-by-step breakdown of how to implement the GraphRAG pipeline in Python, from pre-processing (indexing) to querying and post-processing.
Phase 1: Pre-Processing (The Indexing Pipeline)
This phase transforms your raw text documents into a structured Knowledge Graph with summarized communities.
1. Document Ingestion & Chunking
First, you need to break your source documents into manageable pieces.
Action: Load your text (PDFs, text files) and split them into chunks. Because GraphRAG relies heavily on LLM extraction, slightly larger chunks (e.g., 600–1200 tokens) often work better than tiny semantic chunks.
Python Libraries:
langchain.text_splitter,llama_index.
2. Entity & Relationship Extraction (LLM Query 1)
This is where the "Graph" begins. You pass each chunk to an LLM and ask it to extract nodes (entities) and edges (relationships).
Action: Prompt the LLM to identify key entities (People, Organizations, Concepts) and how they relate.
Example Prompt: "Extract all entities and their relationships from the following text. Output in JSON format:
[{"source": "Entity A", "target": "Entity B", "relationship": "description of relation"}]"Python Libraries:
openai,pydantic(for structured output).
3. Graph Construction & Resolution
Combine the extracted entities and relationships into a mathematical graph structure.
Action: Initialize a graph. Add nodes and edges. You must also perform Entity Resolution (e.g., merging "IBM", "International Business Machines", and "IBM Corp" into a single node) using string matching or LLM evaluation.
Python Libraries:
networkx(for in-memory graphs) orneo4j(for persistent graph databases).
4. Community Detection
To answer broad questions about the entire dataset, GraphRAG groups related entities into "communities."
Action: Run a hierarchical clustering algorithm on your graph. The Leiden algorithm is the standard for GraphRAG, as it efficiently groups densely connected nodes.
Python Libraries:
graspologic,cdlib, origraph.
5. Community Summarization (LLM Query 2)
This is the secret sauce of GraphRAG's "Global Search."
Action: For each community detected in Step 4, gather the descriptions of its nodes and edges, and pass them to the LLM to generate a summary of that community.
Example Prompt: "Given the following entities and relationships that form a community, write a comprehensive summary of what this community represents and its key themes."
Phase 2: The Query Sequence (Retrieval)
When a user asks a question, GraphRAG decides between a Local Search (specific entities) or a Global Search (dataset-wide themes).
6. User Query Processing (LLM Query 3 - Optional)
Action: Pass the user's query to an LLM to extract key entities from the question itself, or use an LLM router to decide if the question requires a Local or Global search.
7A. Local Search Retrieval (For specific questions)
Action: Match the entities in the user's query to nodes in your Knowledge Graph. Retrieve those nodes, their immediate neighbors (1-hop or 2-hop edges), and the original text chunks associated with them.
Bonus: You can also run standard Vector Search in parallel and combine the results.
7B. Global Search Retrieval (For broad questions like "What are the main themes of this dataset?")
Action: Retrieve the Community Summaries generated in Step 5. Since there might be many summaries, you often process them in batches or rank them based on relevance to the query.
Phase 3: Post-Processing (Generation & Synthesis)
8. Context Assembly
Action: Format the retrieved graph data into a readable context window for the LLM. Instead of just pasting text chunks, you provide a structured list of relationships.
Example format: * Entities: Entity A (Description), Entity B (Description)
Relationships: Entity A -> [Relationship] -> Entity B
Relevant Summaries: [Community Summary Text]
9. Final Answer Generation (LLM Query 4)
Action: Prompt the LLM with the assembled context and the user's original query to generate the final response.
Example Prompt: "Using only the provided graph entities, relationships, and community summaries, answer the user's question. Cite the specific nodes or chunks you used to formulate your answer."
10. Formatting and Citations
Action: Parse the LLM's output. If the LLM referenced specific nodes or community summaries, map those back to the original source documents to provide the user with exact citations and confidence scores.
Summary of the LLM Query Sequence:
Pre-processing: Extract Entities/Edges $\rightarrow$ Pre-processing: Summarize Graph Communities.
Querying: Extract Entities from User Query.
Post-processing: Synthesize Final Answer from Graph Context.
Would you like me to provide a minimal, working Python script using NetworkX and LangChain that demonstrates the extraction and graph-building steps?
What does it mean, "Run a hierarchical clustering algorithm on your graph."? What does it actually do?
That phrase is definitely a bit of dense data-science jargon! Let’s strip away the technical terms and look at what it actually does.
When you build a Knowledge Graph from your documents, you end up with a massive, tangled web of thousands of entities (people, places, concepts) connected by lines (relationships). If you ask an AI to "look at the graph and tell me the main themes," it gets overwhelmed.
Running a hierarchical clustering algorithm is basically asking the computer to organize that giant tangled web into neat, nested folders based on who hangs out with whom.
Here is the breakdown of what that actually means and what the algorithm does in practice:
1. The "Clustering" Part (Finding the Cliques)
Imagine your graph is a map of a high school social network.
What it does: The algorithm scans the whole network looking for "dense" areas. It searches for groups of nodes (students) that have a ton of connections to each other, but very few connections to the outside graph.
The Result: It identifies the distinct cliques—the theater kids, the basketball team, the debate club. In a business document, it might group together "HR Policies," "Payroll Systems," and "Employee Benefits" because they are constantly mentioned together.
2. The "Hierarchical" Part (The Nesting Dolls)
"Hierarchical" means the algorithm doesn't just stop at making one set of groups; it builds layers of groups within groups, like Russian nesting dolls or a corporate org chart.
Here is how the algorithm builds those layers:
Level 0 (The Raw Graph): Every individual entity and relationship. It's too noisy to read.
Level 1 (Micro-Communities): The algorithm groups the closest, most tightly knit entities together. (Example: It groups three specific engineers and the specific software feature they are building).
Level 2 (Macro-Communities): The algorithm zooms out. It looks at the Level 1 groups and says, "Which of these groups interact heavily?" It clumps them together. (Example: It groups several software feature teams together into an "Engineering Department" node).
Level 3 (Global Communities): It zooms out again, grouping the Level 2 clusters until you have just a few massive categories that cover the whole dataset. (Example: Grouping Engineering, Sales, and HR into "Company Operations").
Why do we do this in GraphRAG?
We do this so the LLM can generate summaries at every single level.
Once the algorithm finishes grouping everything, the GraphRAG system asks the LLM to write a summary for every Level 1 clique, then every Level 2 department, and so on.
Because of this, if a user asks a highly specific question ("What is Engineer Bob working on?"), the AI can pull the Level 1 summary. But if a user asks a massive, global question ("What are the biggest challenges facing the company overall?"), the AI doesn't have to read a million individual data points—it just reads the Level 3 summaries, which already contain the big-picture themes.
Would you like to see a quick, visual example in Python using a toy dataset (like a 10-node graph) to show exactly how it groups them?
I understand nodes, edges and properties of each. How are cluster groupings and levels accomplished within the database tool?
That is the exact right question to ask next. It’s one thing to understand the concept of a cluster, but how do you actually represent a "folder within a folder" when your database tool only allows for nodes, edges, and properties?
In graph databases (like Neo4j) or memory libraries (like NetworkX), we don't actually have "folders." Instead, we use the graph's native building blocks to construct the hierarchy.
There are two primary ways this is accomplished under the hood: The Property-Based Approach and The Structural (Meta-Node) Approach.
1. The Property-Based Approach (Tagging)
This is the simplest method. After your Python script runs the clustering algorithm (like Leiden) in memory, it figures out which nodes belong to which group at which level. It then updates the original nodes in the database by injecting new properties.
How it works: You add key-value pairs to the existing entities.
Database state: * Node:
BobProperties:
{"type": "Person", "Level_1_Community": "C_14", "Level_2_Community": "C_A"}
Why use it: It’s incredibly fast for querying. If you want to pull all nodes in a specific micro-community, you just query the database to filter where
Level_1_Community == "C_14".The downside: It doesn't give you a clean place to store the LLM-generated summaries of those communities.
2. The Structural Approach (Meta-Nodes)
This is the standard approach for true GraphRAG. Instead of just tagging existing nodes, we actually create brand new nodes in the database to represent the clusters themselves. We call these "Meta-nodes" or "Community Nodes."
Here is how the hierarchy is physically built step-by-step in the database:
Step A: The Base Layer (Level 0) You have your original extracted nodes and edges.
[Node: Person {name: "Bob"}][Node: Concept {name: "GraphRAG"}]
Step B: Creating Level 1 Communities
The algorithm decides Bob and GraphRAG belong together. Your script tells the database to create a new node to represent this cluster, and draws a new type of edge (e.g., IN_COMMUNITY) connecting the base nodes to it.
New Node:
[Node: Community {id: "Level_1_Comm_A", level: 1}]New Edges: *
(Bob) -[:IN_COMMUNITY]-> (Level_1_Comm_A)(GraphRAG) -[:IN_COMMUNITY]-> (Level_1_Comm_A)
Storing the Summary: You generate the summary for this cluster and store it directly as a property on this new node:
Level_1_Comm_A.summary = "Bob is learning about GraphRAG."
Step C: Nesting to Level 2 To create the next level of the hierarchy, you repeat the process, but this time you cluster the Level 1 Community Nodes together.
New Node:
[Node: Community {id: "Level_2_Comm_X", level: 2}]New Edges: *
(Level_1_Comm_A) -[:PARENT_COMMUNITY]-> (Level_2_Comm_X)(Level_1_Comm_B) -[:PARENT_COMMUNITY]-> (Level_2_Comm_X)
How the Database Tool Actually Executes This
Usually, the database tool itself isn't doing the heavy math.
You pull the nodes and edges out of the database into Python.
You run the clustering algorithm using a library like
graspologic.You pass the results back to the database, writing the new Community Nodes and
IN_COMMUNITYedges.
(Note: Enterprise tools like Neo4j have built-in plugins—like the Graph Data Science (GDS) library—that can run algorithms like Leiden directly inside the database using a single Cypher query, automatically creating these properties or relationships without data leaving the server).
Would you like to see what the actual Cypher query (for Neo4j) looks like to retrieve a Level 2 community summary and all the Level 0 entities inside it?
Would this be useful, and if so how would be applied to a knowledge graph of the U.S. Constitution and Bill of Rights?
Yes, it would be incredibly useful! In fact, legal and governance documents like the U.S. Constitution and the Bill of Rights are perfect candidates for GraphRAG.
Standard RAG (vector search) struggles with legal documents because it relies on breaking text into isolated chunks.
GraphRAG solves this by mapping the explicitly interconnected nature of the law. Here is exactly how the hierarchical clustering we just discussed would be applied to the U.S. Constitution.
Level 0: The Base Graph (The Raw Data)
First, the AI reads the Constitution and extracts the raw entities and how they relate.
Nodes (Entities): * Structural: Article I, Article II, 1st Amendment, 14th Amendment.
Actors: The President, Congress, The Supreme Court, The States, Citizens.
Concepts: Veto, Impeachment, Due Process, Freedom of Speech, Treason.
Edges (Relationships): *
[Congress] -OVERRIDES-> [Veto][President] -NOMINATES-> [Supreme Court Justices][21st Amendment] -REPEALS-> [18th Amendment][5th Amendment] -GUARANTEES-> [Due Process]
Level 1: Micro-Communities (Specific Legal Mechanisms)
Once the base graph is built, the clustering algorithm looks for dense connections. It will naturally group related clauses and amendments that govern a specific mechanism, even if they were written centuries apart.
Cluster A (The Criminal Justice Cluster): The algorithm groups the 4th (search & seizure), 5th (self-incrimination/due process), 6th (speedy trial), and 8th (cruel & unusual punishment) Amendments together. The LLM writes a summary: "This community defines the constitutional rights of individuals accused of crimes and the limits of law enforcement."
Cluster B (The Electoral Cluster): It groups Article II (Executive Branch), the 12th Amendment (Electoral College revision), and the 20th Amendment (Lame Duck period). The LLM writes a summary: "This community dictates the rules, timeline, and modification of presidential elections."
Level 2: Macro-Communities (Broad Constitutional Themes)
The algorithm zooms out and looks at how the Level 1 clusters interact with each other. This is where the profound thematic structure of the Constitution emerges.
Cluster X (Separation of Powers): It groups the Legislative, Executive, and Judicial micro-clusters together, because there are dozens of "edges" (like vetoes, confirmations, and judicial review) connecting them. The LLM summary explains the Checks and Balances system.
Cluster Y (Federalism & State Powers): It groups the 10th Amendment (powers reserved to states), Article IV (state relations), and the Supremacy Clause.
Cluster Z (Civil Liberties): It groups the First Amendment cluster, the Criminal Justice cluster, and the 14th Amendment (Equal Protection).
The Practical Benefit for the User
Because you have created these nested groupings and summarized them, your system can now flawlessly answer two entirely different types of questions:
Local Search (Specific): "What is the standard for treason?" The system finds the specific
[Treason]node in Level 0 and gives you the exact definition from Article III, Section 3.Global Search (Thematic): "How did the balance of power between the Federal Government and the States change over time?" A standard RAG would fail completely here. But GraphRAG simply pulls the Level 2 summary for "Federalism & State Powers" and the summary for "Civil Liberties" (which includes the post-Civil War 14th Amendment), allowing the LLM to give a comprehensive, highly accurate essay on how federal power expanded.
To make this work well, you usually have to give the AI a strict "Ontology" (a blueprint of what types of nodes and edges it is allowed to create) before it reads the document.
Would you like to see what an Ontology blueprint for a legal document looks like in Python?
Yes
This is where the magic of GraphRAG really happens. To do this, we are going to use a library called graspologic.
(Fun fact: graspologic is the exact library Microsoft uses under the hood for their official, open-source GraphRAG implementation because it has a built-in hierarchical_leiden function).
Here is the exact code to run the algorithm on our Constitution graph and build those nested "folders."
The Community Detection Script
Make sure you have the library installed (pip install graspologic).
import networkx as nx
from graspologic.partition import hierarchical_leiden
# 1. Convert to Undirected Graph
# The Leiden algorithm looks at structural density (who is hanging out with whom).
# It works best when we ignore the "direction" of the relationship and just look
# at the fact that a connection exists.
G_undirected = G.to_undirected()
# 2. Run the Hierarchical Leiden Algorithm!
# This single line of code does all the heavy mathematical lifting. It finds the
# cliques, groups them, zooms out, and groups them again into levels.
community_partitions = hierarchical_leiden(G_undirected)
# 3. Inject the results back into our original Directed Graph
# 'community_partitions' gives us a list of assignments containing the Node ID,
# the Level of the hierarchy, and the Community ID it belongs to.
print("--- CLUSTERING RESULTS ---")
for partition in community_partitions:
node_id = partition.node
level = partition.level
cluster_id = partition.cluster
# We update the original NetworkX node with its new "folder" tags
# Example: G.nodes["president_us"]["Level_1_Community"] = "Cluster_4"
community_key = f"Level_{level}_Community"
G.nodes[node_id][community_key] = cluster_id
print(f"Node: '{node_id}' --> Assigned to Level {level}, Cluster {cluster_id}")
What just happened?
When you run this script, the algorithm scans the network and assigns every single node to a nested cluster. The output will look something like this:
--- CLUSTERING RESULTS ---
Node: 'president_us' --> Assigned to Level 0, Cluster 1
Node: 'presidential_veto' --> Assigned to Level 0, Cluster 1
Node: 'congress' --> Assigned to Level 0, Cluster 2
...
Node: 'president_us' --> Assigned to Level 1, Cluster 8
Node: 'presidential_veto' --> Assigned to Level 1, Cluster 8
Node: 'congress' --> Assigned to Level 1, Cluster 8
Notice what happened in Level 1? At Level 0 (the micro-level), the President and Congress were in separate clusters. But at Level 1 (the macro-level), the algorithm recognized they share so many edges (vetoes, override votes, impeachments) that it merged them into a single massive cluster: Cluster 8.
Your in-memory NetworkX graph is now fully partitioned. You have successfully grouped all the raw data into highly accurate, thematic buckets.
The final step of the pre-processing pipeline is to take all the entities inside "Cluster 8", bundle their text together, and ask the LLM to write a Community Summary for it.
Would you like me to show you the prompt and Python code to generate those Global Search summaries?
Yes
Here is the Python code and the prompt engineering required to generate the Community Summaries.
This is the final step of the indexing phase. By creating these summaries, you are essentially pre-computing the answers to broad, global questions so the LLM doesn't have to read thousands of raw entities at query time.
The Community Summarization Script
This script loops through the "folders" (clusters) we just created, gathers all the entities and relationships inside them, and asks the LLM to write a comprehensive report.
import os
from openai import OpenAI
import networkx as nx
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def generate_community_summary(community_id: int, subgraph: nx.DiGraph) -> str:
"""Gathers all data in a cluster and prompts the LLM for a summary."""
# 1. Gather all nodes (entities) in this community
entity_descriptions = []
for node_id, data in subgraph.nodes(data=True):
entity_descriptions.append(
f"- {data.get('name', node_id)} ({data.get('entity_type', 'Unknown')}): {data.get('description', '')}"
)
# 2. Gather all edges (relationships) in this community
relationship_descriptions = []
for source, target, data in subgraph.edges(data=True):
rel_type = data.get('relationship_type', 'RELATES_TO')
citation = data.get('citation', 'No citation')
relationship_descriptions.append(
f"- {source} -> {rel_type} -> {target} [Source: {citation}]"
)
# 3. Format the context for the LLM
context_string = "### Entities in this Community:\n" + "\n".join(entity_descriptions) + "\n\n"
context_string += "### Relationships in this Community:\n" + "\n".join(relationship_descriptions)
# 4. The Summarization Prompt (This is the industry-standard style for GraphRAG)
prompt = f"""
You are an expert legal analyst. You are provided with a localized community of
extracted entities and their relationships from the U.S. Constitution.
Your task is to write a comprehensive summary of this community.
Focus on:
1. What is the overarching theme or legal mechanism this community represents?
2. How do the main actors interact within this community?
3. What are the key limitations or powers defined here?
Community Data:
{context_string}
Write a 2-3 paragraph summary explaining the holistic meaning of this network.
"""
# 5. Call the LLM
response = client.chat.completions.create(
model="gpt-4o-mini", # Mini is great and cost-effective for summarization
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return response.choices[0].message.content
# --- Execution ---
# Let's say we want to summarize the "Level 1" communities
communities_at_level_1 = set(
data.get("Level_1_Community") for _, data in G.nodes(data=True) if "Level_1_Community" in data
)
# Create a dictionary to store our generated summaries
community_reports = {}
print("Generating Community Summaries...")
for cluster_id in communities_at_level_1:
# Get all nodes that belong to this specific cluster
nodes_in_cluster = [
node for node, data in G.nodes(data=True)
if data.get("Level_1_Community") == cluster_id
]
# Create a sub-graph of just this community
subgraph = G.subgraph(nodes_in_cluster)
# Generate the summary
summary = generate_community_summary(cluster_id, subgraph)
community_reports[cluster_id] = summary
print(f"\n--- Summary for Cluster {cluster_id} ---")
print(summary)
What This Does to Your System
If you run this on the Constitution, the LLM looks at the raw data for Cluster 8 (which contains the President, Congress, Vetoes, and Override Votes) and outputs something like this:
Summary for Cluster 8 This community represents the legislative check-and-balance mechanism known as the Presidential Veto. The primary actors are Congress and the President of the United States.
Under Article I, Section 7, Congress is granted the power to pass legislation, but the President is granted the power to limit that legislation by returning it with objections (a veto). However, the network demonstrates that the President's power is not absolute; Congress retains the ultimate authority to override the veto and force the bill into law, establishing a strict balance of power between the executive and legislative branches.
You have officially built the GraphRAG Index!
You now have:
The raw text documents.
A Knowledge Graph of specific entities and relationships.
Pre-computed thematic summaries of how those entities interact.
Would you like to move to Phase 2 and see the code for how to actually query this system when a user asks a question like, "How are laws passed?"
From <https://gemini.google.com/app/> Google Gemini 3.1 (Pro)
Suggest 25 great .com domain names for a website that provides a knowledge graph of the U.S. Constitution and Bill of Rights, and all of the queries and analysis of these documents that a knowledge graph supports.
Structuring the foundational documents of the U.S. into a knowledge graph is a fantastic approach to enabling deep semantic search and complex relationship queries.
Here are 25 .com domain name ideas, categorized by the specific branding angle they highlight for a highly analytical, data-driven application.
Graph & Architecture Focus
These names emphasize the technical structure of the platform, appealing to users who understand nodes, edges, and data relationships.
https://www.google.com/search?q=ConstitutionGraph.com
https://www.google.com/search?q=CivicNodes.com
https://www.google.com/search?q=TheFoundingGraph.com
https://www.google.com/search?q=LibertyEdges.com
https://www.google.com/search?q=GraphingTheRepublic.com
Semantic & Query Focus
These options highlight the interactive nature of the site, focusing on the ability to actively parse, query, and analyze the documents.
https://www.google.com/search?q=SemanticConstitution.com
https://www.google.com/search?q=QueryTheFramers.com
https://www.google.com/search?q=ConstitutionalLogic.com
https://www.google.com/search?q=RightsQuery.com
https://www.google.com/search?q=ParseTheRepublic.com
Analytical & Discovery Focus
These names lean into the insights and discoveries that a well-structured ontology can provide when mapping historical and legal texts.
https://www.google.com/search?q=TheCivicOntology.com
https://www.google.com/search?q=DecodingLiberty.com
https://www.google.com/search?q=AmendmentAnalysis.com
https://www.google.com/search?q=ConstitutionalMapper.com
https://www.google.com/search?q=MappingRights.com
Short & Brandable
Punchy, memorable domains that combine the concepts of law, civics, and data structures into a single, cohesive brand.
https://www.google.com/search?q=ConstiGraph.com
https://www.google.com/search?q=LexGraphia.com
https://www.google.com/search?q=CivicGraph.com
https://www.google.com/search?q=RightsMap.com
https://www.google.com/search?q=FramersData.com
Heritage & Connection Focus
These domains emphasize how the different articles, sections, and amendments relate to and reference one another.
https://www.google.com/search?q=ConnectedConstitution.com
https://www.google.com/search?q=LinkedLiberty.com
https://www.google.com/search?q=FoundingLinks.com
https://www.google.com/search?q=ConstitutionalTies.com
https://www.google.com/search?q=NetworkOfRights.com
- ConstitutionalMapper.com
- ConstitutionalLogic.com
- SemanticConstitution.com
- ConstitutionGraph.com
- AmendmentAnalysis.com
- RightsQuery.org
Would you like me to draft a Python script using a library like spaCy or NetworkX to begin extracting the initial entities and relationships from the text of the Constitution?
Yes
Here is a Python script that builds a foundational knowledge graph using the Preamble and Article I, Section 1 of the Constitution.
It uses spaCy for Natural Language Processing (to extract entities and noun chunks), NetworkX to construct the graph logic, and PyVis to generate an interactive map.
Prerequisites
Before running the script, you will need to install the required libraries and download the spaCy English language model:
pip install spacy networkx pyvis
python -m spacy download en_core_web_sm
The Knowledge Graph Script
import spacy
import networkx as nx
from pyvis.network import Network
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")
# Core texts to analyze
documents = {
"Preamble": "We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.",
"Article I, Section 1": "All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives."
}
# Initialize a NetworkX graph
G = nx.Graph()
for section_title, text in documents.items():
# Add the document section as a central node
G.add_node(section_title, title=section_title, group="Document", size=25, color="#ff5733")
doc = nlp(text)
# Extract named entities (like "United States", "Senate")
entities = {ent.text for ent in doc.ents}
# Extract significant noun chunks for broader concepts (like "legislative Powers")
chunks = {chunk.text for chunk in doc.noun_chunks if len(chunk.text.split()) < 4}
# Combine entities and chunks
concepts = entities.union(chunks)
for concept in concepts:
# Basic filtering to remove overly short strings or pure stop words
if len(concept) > 3 and concept.lower() not in ["we", "order", "all", "which"]:
clean_concept = concept.replace("\n", " ").strip()
# Add the concept node and connect it to the document section
G.add_node(clean_concept, title=clean_concept, group="Concept", size=15, color="#33C1FF")
G.add_edge(section_title, clean_concept, weight=1)
# Convert NetworkX graph to an interactive PyVis network
net = Network(height="800px", width="100%", bgcolor="#222222", font_color="white", select_menu=True)
net.from_nx(G)
# Toggle physics for a stable layout
net.toggle_physics(True)
# Write out to a standalone HTML file.
# Writing to a separate file and opening it in a new browser tab cleanly
# bypasses iframe sandbox rendering errors often encountered within Jupyter notebooks.
output_file = "constitution_graph.html"
net.write_html(output_file)
print(f"Graph generated successfully: {output_file}")
print("Open this file in your web browser to explore the interactive nodes.")
How it Works
Extraction: The script passes the raw text through spaCy to isolate the meaningful actors and concepts (e.g., "Congress", "legislative Powers", "domestic Tranquility").
Structuring: NetworkX maps the relationships, assigning edges between the foundational text (the Preamble/Article) and the concepts it contains.
Visualization: PyVis renders this logic into an interactive HTML canvas where you can drag nodes, zoom in, and use the select menu to highlight specific entities. Generating a standalone HTML file ensures the graph displays reliably outside of restricted notebook environments.
Would you like to expand this to pull the full text of the Constitution from an API or text file, or would you prefer to refine the NLP extraction logic first?
Full text of the Constitution from a text file
Scaling up to the full text is the right move, but we need to manage how the data is parsed so we don't end up with an unreadable "hairball" of thousands of nodes.
When dealing with a highly structured document like the Constitution, using a regex approach to definitively catch and extract the structural boundaries (Articles, Sections, and Amendments) is much cleaner and more reliable than relying on an NLP model to guess where those sections begin and end. We can use regex to build the backbone of the graph, and then let spaCy extract the specific entities within those chunks.
If you need a clean, plain-text copy of the document, you can grab the UTF-8 version from Project Gutenberg here:
Here is the updated script designed to parse a local constitution.txt file:v
import spacy
import networkx as nx
from pyvis.network import Network
import re
# Load the NLP model
nlp = spacy.load("en_core_web_sm")
def process_constitution(filepath, max_entities=400):
G = nx.Graph()
try:
with open(filepath, 'r', encoding='utf-8') as f:
text = f.read()
except FileNotFoundError:
print(f"Error: Could not find '{filepath}'. Please ensure the file is in the same directory.")
return None
# Regex to definitively catch document structure (e.g., "Article I", "Section 1", "Amendment I")
header_pattern = re.compile(r'^\s*(Article\s+[IVXLCDM]+|Section\s+\d+|Amendment\s+[IVXLCDM]+)', re.IGNORECASE)
# Split text into paragraphs based on double newlines
paragraphs = [p.strip() for p in text.split('\n\n') if len(p.strip()) > 10]
current_section = "Preamble / Introduction"
entity_count = 0
for para in paragraphs:
# Hard cap to prevent the resulting HTML canvas from locking up the browser
if entity_count >= max_entities:
print(f"Reached cap of {max_entities} entities. Stopping to preserve graph performance.")
break
# Check if the paragraph is a structural header
match = header_pattern.match(para)
if match:
# Update the current section context and skip processing the header text itself
current_section = match.group(1).title()
continue
# Ensure the parent section node exists in the graph
if not G.has_node(current_section):
G.add_node(current_section, title=current_section, group="Document", size=25, color="#ff5733")
# Parse the actual text block for entities
doc = nlp(para)
# Extract named entities, ignoring overly short strings
entities = {ent.text.replace("\n", " ").strip() for ent in doc.ents if len(ent.text.strip()) > 3}
for entity in entities:
# Basic noise filtering (you can expand this exclusion list)
if entity.lower() not in ["united states", "constitution", "shall", "congress"]:
if not G.has_node(entity):
G.add_node(entity, title=entity, group="Concept", size=15, color="#33C1FF")
entity_count += 1
# Link the extracted entity back to its structural section
G.add_edge(current_section, entity, weight=1)
return G
# Build the Graph
graph = process_constitution("constitution.txt", max_entities=500)
if graph:
# Render the interactive map
net = Network(height="800px", width="100%", bgcolor="#222222", font_color="white", select_menu=True)
net.from_nx(graph)
# Toggle physics for a stable layout
net.toggle_physics(True)
# Writing to a standalone HTML file cleanly bypasses sandbox
# rendering errors often encountered in notebook environments.
output_file = "full_constitution_graph.html"
net.write_html(output_file)
print(f"Graph generated successfully: {output_file}")
Key Adjustments Made
Regex Chunking: The script now watches for headers like
Article IorAmendment Vto build the primary "Document" nodes, naturally connecting the concepts back to their specific constitutional location.Node Capping: I added a
max_entitieslimit. Physics engines in browser-based graphs start to choke when you push past several hundred nodes simultaneously. You can dial this number up or down depending on your hardware.Noise Filtering: Added a basic exclusion list so that high-frequency, low-value words don't artificially centralize the graph.
Would you like to explore adding a semantic similarity pass (perhaps using a lightweight embedding model) to merge duplicate concepts like "the Senate" and "Senators" before they are plotted on the graph?
Yes
Adding a semantic similarity pass is a highly effective way to deduplicate the graph. Instead of relying on exact string matches, we can convert the extracted text into vector embeddings and merge nodes that are mathematically close in meaning.
A great tool for this is the sentence-transformers library, which runs locally and is perfect for lightweight, fast clustering of short phrases without needing to call an external LLM.
Prerequisites Update
You will need to install the sentence transformers library and scikit-learn (for calculating the cosine similarity between the vectors):
pip install sentence-transformers scikit-learn
The Semantic Merging Script
This updated script introduces a clustering step. Before anything is added to the graph, it encodes all discovered entities, compares them, and maps variations (like "Senators") to a canonical base concept (like "Senate").
import spacy
import networkx as nx
from pyvis.network import Network
import re
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Load the NLP and Embedding models
nlp = spacy.load("en_core_web_sm")
# all-MiniLM-L6-v2 is fast, lightweight, and great for general semantic similarity
embedder = SentenceTransformer('all-MiniLM-L6-v2')
def merge_similar_entities(entities, similarity_threshold=0.85):
"""
Groups semantically similar entities and returns a mapping dictionary
pointing variations to a single canonical name.
"""
if not entities:
return {}
entity_list = list(entities)
embeddings = embedder.encode(entity_list)
similarity_matrix = cosine_similarity(embeddings)
canonical_mapping = {}
processed_indices = set()
for i in range(len(entity_list)):
if i in processed_indices:
continue
# The first instance becomes our canonical base for this cluster
canonical_name = entity_list[i]
canonical_mapping[canonical_name] = canonical_name
processed_indices.add(i)
# Find all other entities similar to this one
for j in range(i + 1, len(entity_list)):
if j not in processed_indices and similarity_matrix[i][j] >= similarity_threshold:
# Map the variation to the canonical name
canonical_mapping[entity_list[j]] = canonical_name
processed_indices.add(j)
return canonical_mapping
def process_constitution_with_semantics(filepath, max_entities=400):
G = nx.Graph()
try:
with open(filepath, 'r', encoding='utf-8') as f:
text = f.read()
except FileNotFoundError:
print(f"Error: Could not find '{filepath}'")
return None
header_pattern = re.compile(r'^\s*(Article\s+[IVXLCDM]+|Section\s+\d+|Amendment\s+[IVXLCDM]+)', re.IGNORECASE)
paragraphs = [p.strip() for p in text.split('\n\n') if len(p.strip()) > 10]
current_section = "Preamble"
raw_extractions = []
# Pass 1: Extract everything using regex boundaries and spaCy
for para in paragraphs:
match = header_pattern.match(para)
if match:
current_section = match.group(1).title()
continue
doc = nlp(para)
entities = {ent.text.replace("\n", " ").strip() for ent in doc.ents if len(ent.text.strip()) > 3}
for entity in entities:
if entity.lower() not in ["united states", "constitution", "shall", "congress"]:
raw_extractions.append((current_section, entity))
# Pass 2: Semantic Deduplication
unique_entities = set([e for _, e in raw_extractions])
print(f"Discovered {len(unique_entities)} unique raw entities. Running semantic compression...")
entity_mapping = merge_similar_entities(unique_entities, similarity_threshold=0.85)
canonical_entities = set(entity_mapping.values())
print(f"Compressed down to {len(canonical_entities)} canonical entities.")
# Pass 3: Build the Graph using the canonical mappings
entity_count = 0
for section, raw_entity in raw_extractions:
if entity_count >= max_entities:
break
canonical_entity = entity_mapping[raw_entity]
# Add Section Node
if not G.has_node(section):
G.add_node(section, title=section, group="Document", size=25, color="#ff5733")
# Add Canonical Concept Node
if not G.has_node(canonical_entity):
G.add_node(canonical_entity, title=canonical_entity, group="Concept", size=15, color="#33C1FF")
entity_count += 1
# Link them
G.add_edge(section, canonical_entity, weight=1)
return G
# Execute
graph = process_constitution_with_semantics("constitution.txt", max_entities=500)
if graph:
net = Network(height="800px", width="100%", bgcolor="#222222", font_color="white", select_menu=True)
net.from_nx(graph)
net.toggle_physics(True)
output_file = "semantic_constitution_graph.html"
net.write_html(output_file)
print(f"Graph generated successfully: {output_file}")
How the Deduplication Works
Extraction (Pass 1): We sweep through the document, gathering all entities and associating them with their structural chunk (Article, Amendment, etc.).
Vectorization (Pass 2): We feed all those raw strings into the
MiniLMmodel, turning them into dense mathematical vectors.Cosine Similarity: We calculate the distance between every vector. If two phrases score above
0.85(meaning they share high semantic overlap), we group them.Graphing (Pass 3): When we finally build the NetworkX graph, any time the script encounters "Senators" or "the Senate", it looks at our mapping dictionary, sees they point to the same canonical base concept, and routes the edge to that single node.
This results in a much cleaner, tighter ontology that focuses on actual concepts rather than grammatical variations.
Would you like me to show you how to add a third node type to represent the relationships themselves (e.g., [President] -> appoints -> [Judges]), upgrading this from a basic property graph to a true semantic triple store?
Yes
To upgrade this into a true semantic triple store, we need to shift from extracting disconnected entities to extracting Subject-Predicate-Object (SPO) triples.
This means instead of just knowing that "Congress" and "Taxes" exist in Article I, we want the graph to explicitly state: [Congress] -> (shall have power to lay) -> [Taxes].
To achieve this, we can leverage spaCy's Dependency Parser. Rather than just looking at the types of words, the parser analyzes the grammatical structure of the sentence to find verbs (the predicate) and map them to their nominal subjects and direct objects.
Here is the updated script. Notice that we are switching to a Directed Graph (nx.DiGraph()) because semantic triples have a specific direction (A acts upon B).
The Semantic Triple Extraction Script
import spacy
import networkx as nx
from pyvis.network import Network
import re
# Load the NLP model
nlp = spacy.load("en_core_web_sm")
def extract_triples(sentence_doc):
"""
Uses dependency parsing to find Subject-Verb-Object relationships.
"""
triples = []
for token in sentence_doc:
# Look for the main action (verb)
if token.pos_ == "VERB":
subject = None
object_ = None
# Trace left children to find the subject
for child in token.lefts:
if child.dep_ in ["nsubj", "nsubjpass"]:
# Grab the core noun rather than the whole massive subtree
subject = child.lemma_.lower()
break
# Trace right children to find the direct object or prepositional object
for child in token.rights:
if child.dep_ in ["dobj", "pobj", "attr"]:
object_ = child.lemma_.lower()
break
if subject and object_:
# We have a complete triple
predicate = token.lemma_.lower()
triples.append((subject, predicate, object_))
return triples
def build_triple_graph(filepath, max_edges=300):
# Use a Directed Graph for A -> B relationships
G = nx.DiGraph()
try:
with open(filepath, 'r', encoding='utf-8') as f:
text = f.read()
except FileNotFoundError:
print(f"Error: Could not find '{filepath}'")
return None
# Regex to definitively catch document structure
header_pattern = re.compile(r'^\s*(Article\s+[IVXLCDM]+|Section\s+\d+|Amendment\s+[IVXLCDM]+)', re.IGNORECASE)
paragraphs = [p.strip() for p in text.split('\n\n') if len(p.strip()) > 10]
edge_count = 0
current_section = "Preamble"
for para in paragraphs:
if edge_count >= max_edges:
break
match = header_pattern.match(para)
if match:
current_section = match.group(1).title()
continue
# Parse the paragraph
doc = nlp(para)
# Break paragraph into individual sentences for accurate dependency parsing
for sent in doc.sents:
triples = extract_triples(sent)
for sub, pred, obj in triples:
if edge_count >= max_edges:
break
# Filter out overly generic pronouns or single-letter errors
if len(sub) > 2 and len(obj) > 2 and sub not in ["who", "which", "that"]:
# Add nodes
if not G.has_node(sub):
G.add_node(sub, title=sub, group="Subject", size=15, color="#33C1FF")
if not G.has_node(obj):
G.add_node(obj, title=obj, group="Object", size=15, color="#FFC300")
# Add the directed edge with the predicate as the label
G.add_edge(sub, obj, title=pred, label=pred, arrows="to")
edge_count += 1
return G
# Execute
graph = build_triple_graph("constitution.txt", max_edges=400)
if graph:
# Render the interactive map
net = Network(height="800px", width="100%", directed=True, bgcolor="#222222", font_color="white", select_menu=True)
net.from_nx(graph)
# Toggle physics for a stable layout
net.toggle_physics(True)
# Write to a standalone HTML file to bypass iframe sandbox rendering errors
output_file = "triple_constitution_graph.html"
net.write_html(output_file)
print(f"Graph generated successfully: {output_file}")
How the Triple Extraction Works
Dependency Mapping: Instead of just grouping nouns, the script looks for verbs (
token.pos_ == "VERB").Subject/Object Tracing: Once a verb is found, it traverses the syntax tree to the left to find the actor (
nsubj), and to the right to find the receiver (dobjorpobj).Directed Edges: In NetworkX,
G.add_edge(sub, obj, label=pred)explicitly draws an arrow from the Subject node to the Object node, writing the Verb directly onto the line connecting them in the PyVis output.
This creates a highly analytical view of the text. You will start to see hubs emerge around key actors (like "president" or "congress") with arrows pointing outward detailing exactly what they are empowered to do.
Would you like me to write a function that exports these extracted triples into standard Cypher queries, so you can load this structured data directly into a dedicated graph database like Neo4j?
Yes
Exporting this data to Cypher is exactly how you transition this from a visualization script into a robust, queryable backend.
When generating Cypher queries, the best practice is to use the MERGE command instead of CREATE. MERGE acts like a "match or create" operation, ensuring that we don't create duplicate nodes for "congress" or "president" every time they appear in a new triple.
Additionally, Neo4j requires relationship types (the predicates/verbs) to be formatted as single words without spaces, conventionally in uppercase.
Here is the function to generate production-ready Cypher queries from your extracted triples.
The Cypher Export Function
You can add this function directly to the bottom of the previous script. Instead of just passing the triples to NetworkX, we will also route them through this formatter.
import re
def generate_cypher_queries(triples, output_file="constitution_cypher.cql"):
"""
Takes a list of (Subject, Predicate, Object) tuples and generates
Neo4j-compatible Cypher MERGE statements.
"""
cypher_statements = []
# Optional: Add a constraint to ensure node names are unique and lookups are fast
cypher_statements.append("CREATE CONSTRAINT IF NOT EXISTS FOR (e:Entity) REQUIRE e.name IS UNIQUE;\n")
for sub, pred, obj in triples:
# Clean up the strings to prevent Cypher syntax errors
# Escape quotes if they exist in the extracted text
clean_sub = sub.replace('"', '\\"')
clean_obj = obj.replace('"', '\\"')
# Cypher relationships cannot have spaces and are conventionally UPPER_SNAKE_CASE
# e.g., "shall have" becomes "SHALL_HAVE"
clean_pred = re.sub(r'[^a-zA-Z0-9]+', '_', pred).strip('_').upper()
# Fallback if predicate gets stripped entirely
if not clean_pred:
clean_pred = "RELATES_TO"
# 1. MERGE the Subject node
query_sub = f'MERGE (s:Entity {{name: "{clean_sub}"}})'
# 2. MERGE the Object node
query_obj = f'MERGE (o:Entity {{name: "{clean_obj}"}})'
# 3. MERGE the Relationship between them
query_rel = f'MERGE (s)-[:{clean_pred}]->(o)'
# Combine into a single transaction statement
full_query = f"{query_sub}\n{query_obj}\nWITH s, o\n{query_rel};"
cypher_statements.append(full_query)
# Write the queries to a .cql file
with open(output_file, 'w', encoding='utf-8') as f:
f.write("\n\n".join(cypher_statements))
print(f"Exported {len(triples)} Cypher queries to {output_file}")
return cypher_statements
# --- Integration Example ---
# Assuming 'all_extracted_triples' is a list of (sub, pred, obj) gathered during the parsing loop:
#
# all_extracted_triples = [
# ("congress", "lay and collect", "taxes"),
# ("president", "shall be", "commander in chief")
# ]
#
# generate_cypher_queries(all_extracted_triples)
How to use the output
Running this function will generate a .cql (Cypher Query Language) file. The contents will look like this:
CREATE CONSTRAINT IF NOT EXISTS FOR (e:Entity) REQUIRE e.name IS UNIQUE;
MERGE (s:Entity {name: "congress"})
MERGE (o:Entity {name: "taxes"})
WITH s, o
MERGE (s)-[:LAY_AND_COLLECT]->(o);
You can take this file and execute it directly in the Neo4j Browser or pipe it in via the Neo4j command-line tools. The script uses the WITH s, o clause to carry the newly merged (or matched) variables over to the relationship creation step, ensuring everything binds together correctly in the database.
Would you like me to show you a few example Cypher queries you could run against this Neo4j database once it is populated, to demonstrate how to pull specific analytical insights (like listing all powers explicitly granted to a specific entity)?
Yes
Once you have your triples ingested into Neo4j via those MERGE statements, the real power of the graph emerges. You are no longer just doing keyword searches; you are querying the semantic architecture of the document.
Here are a few powerful Cypher queries that illustrate how to extract analytical insights from the Constitution graph.
1. The Power Profile (Degree Centrality)
If you want to quickly see which entities are granted the most explicit actions or authorities, you can count the outgoing relationships. This essentially profiles the "weight" of a node in the document.
// Find the entities with the most outgoing actions (predicates)
MATCH (subject:Entity)-[action]->(object:Entity)
RETURN subject.name AS Actor, count(action) AS NumberOfActions, collect(type(action)) AS SpecificActions
ORDER BY NumberOfActions DESC
LIMIT 10;
Why this is useful: It immediately highlights the primary actors (like "congress" or "president") and aggregates exactly what verbs are attached to them across the entire text.
2. The Scope of Authority (First-Degree Traversal)
If you want to isolate a specific branch of government and map out exactly what objects or concepts it has authority over, you can lock in the Subject node and return the subgraph.
// What exactly does Congress do, and to what?
MATCH (c:Entity {name: "congress"})-[action]->(target:Entity)
RETURN c.name AS Actor, type(action) AS Action, target.name AS Target
ORDER BY Action ASC;
Why this is useful: It acts as a definitive list of constitutional powers (e.g., Congress -> LAY_AND_COLLECT -> taxes, Congress -> DECLARE -> war).
3. Indirect Influence (Multi-Hop Pathfinding)
Graphs excel at finding indirect relationships that a standard relational database would struggle to surface without complex JOINs. You can query for paths where one entity affects another through an intermediary.
// Find 2-hop relationships starting from the President
MATCH path = (p:Entity {name: "president"})-[*2]->(target:Entity)
RETURN [node in nodes(path) | node.name] AS EntityChain,
[rel in relationships(path) | type(rel)] AS ActionChain
LIMIT 20;
Why this is useful: It reveals chain-of-command or multi-step processes (e.g., President -> APPOINTS -> Judges -> TRY -> Cases).
4. Shared Context (Intersection Searching)
You can find concepts that are structurally bound by multiple actors. For instance, what objects or concepts do both the Senate and the House of Representatives interact with?
// Find targets that both the Senate and the House act upon
MATCH (senate:Entity {name: "senate"})-[action1]->(target:Entity)<-[action2]-(house:Entity {name: "house of representatives"})
RETURN target.name AS SharedConcept, type(action1) AS SenateAction, type(action2) AS HouseAction;
Why this is useful: It automatically flags the intersection of powers, finding the overlapping nodes where checks and balances likely occur.
Would you like me to draft a Python function that uses an LLM to take a natural language question (like "What powers are shared by the House and Senate?"), dynamically translate it into one of these Cypher queries, execute it against Neo4j, and return the answer?
Yes
To connect natural language to your Neo4j database, the most effective approach is to use a framework like LangChain. LangChain has a dedicated package (langchain-neo4j) that contains a pre-built pipeline called GraphCypherQAChain.
This chain performs a few steps automatically:
It injects your database schema (node types, relationships, properties) into an LLM's prompt.
It passes the user's natural language question to the LLM and asks it to generate a valid Cypher query.
It executes that Cypher query directly against Neo4j.
It passes the resulting data back to the LLM to formulate a conversational answer.
Prerequisites
You will need to install the LangChain Neo4j and OpenAI integration packages:
pip install langchain-neo4j langchain-openai python-dotenv
The Text-to-Cypher Script
Here is the Python function to set up the connection and query your graph dynamically:
import os
from langchain_openai import ChatOpenAI
from langchain_neo4j import Neo4jGraph, GraphCypherQAChain
# Set your API keys and Neo4j credentials
# In production, these should be loaded from a .env file
os.environ["OPENAI_API_KEY"] = "sk-your-openai-api-key"
NEO4J_URI = "bolt://localhost:7687" # Or your Neo4j AuraDB URI
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "your-password"
def query_constitution_graph(user_question):
"""
Translates a natural language question into a Cypher query,
executes it against Neo4j, and returns a natural language response.
"""
# 1. Connect to the Graph Database
# This automatically loads the schema into memory so the LLM knows what to query
graph = Neo4jGraph(
url=NEO4J_URI,
username=NEO4J_USERNAME,
password=NEO4J_PASSWORD
)
# 2. Initialize the LLM
# A temperature of 0 is critical here to ensure the LLM prioritizes
# exact Cypher syntax over creative generation.
llm = ChatOpenAI(temperature=0, model="gpt-4o")
# 3. Build the QA Chain
# verbose=True allows you to see the generated Cypher query in your console.
# allow_dangerous_requests=True is required by LangChain to acknowledge that
# the LLM will be executing read commands against your database.
chain = GraphCypherQAChain.from_llm(
llm=llm,
graph=graph,
verbose=True,
allow_dangerous_requests=True
)
# 4. Invoke the Chain with the user's question
print(f"\n--- Analyzing Question: '{user_question}' ---")
try:
response = chain.invoke({"query": user_question})
return response["result"]
except Exception as e:
return f"An error occurred while querying the graph: {e}"
# --- Execution Example ---
if __name__ == "__main__":
question = "What powers are shared by the House and the Senate?"
answer = query_constitution_graph(question)
print("\n--- Final Answer ---")
print(answer)
How to use this
The Schema Context: The
Neo4jGraphobject automatically pings your database and retrieves the schema (e.g., it sees that you haveEntitynodes connected by anAPPOINTSrelationship). The LLM uses this schema to ensure it only writes Cypher queries targeting labels that actually exist.Model Selection: The
gpt-4omodel is highly recommended for this specific task. Generating robust graph queries requires strong logical reasoning, and smaller or faster models often struggle with Cypher syntax errors.Security Note: The
allow_dangerous_requests=Trueflag is a LangChain safety mechanism. It simply acknowledges that you are allowing an LLM to run queries against your database. It is highly recommended that the database credentials you provide have read-only access, preventing the LLM from accidentally generating and runningDELETEorDROPcommands.
Would you like me to show you how to inject a custom "Prompt Template" into this chain so you can instruct the LLM on specific constitutional terminology or guide it to write more complex semantic queries?
Yes
Taking direct control of the prompt is exactly the right move here. When dealing with highly specialized domains like constitutional law—especially when you've deterministically built your graph using precise regex extraction—you don't want to rely on the LLM's generic guesswork for querying.
LangChain's GraphCypherQAChain allows you to inject a custom PromptTemplate via the cypher_prompt parameter. This template must include two specific variables: {schema} (which LangChain automatically populates with your Neo4j node/relationship structures) and {question} (the user's input).
Here is how you define and inject a prompt that enforces your specific graph conventions:
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain_neo4j import Neo4jGraph, GraphCypherQAChain
import os
# Set your API keys and Neo4j credentials
os.environ["OPENAI_API_KEY"] = "sk-your-openai-api-key"
NEO4J_URI = "bolt://localhost:7687"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "your-password"
# 1. Define the Custom Cypher Prompt
# Notice how we enforce the formatting rules established by your extraction script
CYPHER_GENERATION_TEMPLATE = """
You are an expert graph database developer mapping the US Constitution.
Translate the user's natural language question into a precise Cypher query.
STRICT RULES:
1. Use only the node labels and relationship types provided in the schema below.
2. All relationship types in this graph are in UPPER_SNAKE_CASE (e.g., SHALL_HAVE, APPOINTS).
3. All entity names are lowercase. If the user asks about "Congress", query for "congress".
4. Never generate queries that mutate the database (No CREATE, SET, DELETE, or DROP). Only use MATCH and RETURN.
5. If the user asks for a path or chain of events, use variable-length paths (e.g., -[*1..3]->).
Schema:
{schema}
Question:
{question}
Cypher Query:
"""
custom_cypher_prompt = PromptTemplate(
input_variables=["schema", "question"],
template=CYPHER_GENERATION_TEMPLATE
)
def query_with_custom_prompt(user_question):
graph = Neo4jGraph(url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD)
llm = ChatOpenAI(temperature=0, model="gpt-4o")
# 2. Inject the custom prompt into the chain
chain = GraphCypherQAChain.from_llm(
llm=llm,
graph=graph,
cypher_prompt=custom_cypher_prompt, # <--- Injected here
verbose=True,
allow_dangerous_requests=True
)
try:
response = chain.invoke({"query": user_question})
return response["result"]
except Exception as e:
return f"Error: {e}"
# Example Execution
if __name__ == "__main__":
q = "What explicit powers does congress have regarding taxes?"
print(query_with_custom_prompt(q))
Why this structure works well:
Enforcing Lowercase: Because your Python script extracted core entities like "congress" and "president" in lowercase, telling the LLM this fact prevents it from running
MATCH (n {name: "Congress"})and returning empty results.Relationship Formatting: Explicitly reminding the LLM about
UPPER_SNAKE_CASEstops it from trying to query[:shall have](which is invalid Cypher) instead of[:SHALL_HAVE].Security & Read-Only: The
STRICT RULESsection acts as a secondary safeguard against prompt injection, reinforcing that the LLM should only ever execute read operations.
Would you like me to write a quick Streamlit wrapper for this script, so you can interact with your AI-powered Constitutional query engine through a clean, web-based chat interface instead of the terminal?
From <https://gemini.google.com/app/> Google Gemini 3.1 (Pro)
Given this full chat session, are document embeddings and indexing included as part of the solution?
To give you a direct and candid answer: No, traditional document embeddings and vector indexing were not included in the Python code blocks I provided. The step-by-step guide I walked you through focused on a "pure" graph extraction approach (Document --> LLM Extraction --> NetworkX Graph --> Leiden Clustering --> LLM Summarization). I left embeddings out to isolate and explain the complex mechanics of how the graph itself is actually built and clustered.
However, in a true, production-ready GraphRAG system (such as Microsoft’s official open-source GraphRAG library), vector embeddings are an absolute requirement. They are the bridge that connects the user's natural language question to the structured graph.
Here is exactly how embeddings and vector indexing fit into the complete architecture:
1. The Indexing Phase (Adding Embeddings)
In the pipeline we built, we generated text descriptions for every Entity, Relationship, and Community.
You pass all of that generated text through an embedding model (like OpenAI's text-embedding-3-small) to convert them into mathematical vectors, and you store those vectors in a Vector Database (like Pinecone, Weaviate, or a Postgres database with pgvector).
Chunk Embeddings: The raw text chunks are embedded.
Node Embeddings: The LLM-generated descriptions of entities (e.g., "The President") are embedded.
Community Embeddings: The LLM-generated summaries of your Level 1 and Level 2 clusters are embedded.
2. The Query Phase (The Hybrid Approach)
Without embeddings, finding the right starting point in a massive graph is slow and expensive because you have to ask an LLM to read the user's prompt and "guess" which exact nodes to look up.
With embeddings, the process becomes incredibly fast and accurate:
The Semantic Match: When a user asks, "How does the executive branch stop a law?", the system embeds that question into a vector.
Finding the Entry Point: It runs a lightning-fast mathematical similarity search against your Vector Database. The vector math immediately recognizes that "executive branch stop a law" is mathematically closest to the embedded description of the
[presidential_veto]node and the[Cluster 8]community summary.Graph Traversal: Now that the vector search has handed you the exact "entry nodes," your system can traverse the graph (finding the connected
[Congress]node and related constitutional clauses) to assemble the final context window for the LLM.
Summary
Vector databases and Knowledge Graphs are not competitors in GraphRAG; they are teammates. The vector search acts as the lightning-fast "index finger" pointing to the right area of the map, and the graph provides the actual roads and context surrounding that area.
Would you like me to show you the Python code for how to generate OpenAI embeddings for our Constitution nodes and run a quick similarity search to find the right entry point?
Since embedding was not included in the original design, how was the full text of the Constitution and Bill of Rights handled in the knowledge graph?
That is a fantastic question. If the system only relies on LLM-generated summaries and extracted entities, you run a massive risk of losing the exact, verbatim text of the law—which is obviously unacceptable for a document like the Constitution.
In a pure GraphRAG architecture (like the exact methodology Microsoft developed), the raw text is never discarded. Instead, the text chunks themselves become a physical part of the Knowledge Graph.
Here is exactly how the full text of the Constitution and Bill of Rights is handled and preserved without using a vector database:
1. Raw Text Chunks Become "TextUnit" Nodes
Before any extraction happens, the raw text of the Constitution is split into chunks (e.g., Article I, Section 1; Article I, Section 2).
Instead of just holding these in temporary memory, the system creates a specific type of node in the graph called a TextUnit (or Document Node). The entire raw text of that chunk is saved as a property inside that node.
2. The "Provenance" Edges (Entity-to-Text Mapping)
When we pass that chunk to the LLM and ask it to extract entities and relationships (like we did with our Pydantic script earlier), the system automatically draws a structural edge connecting every extracted entity back to the exact TextUnit it came from.
If you looked at the database, the structure would look like this:
[Node: TextUnit {id: "chunk_05"}] * Property
text: "The President, Vice President and all civil Officers of the United States, shall be removed from Office on Impeachment for, and Conviction of, Treason, Bribery, or other high Crimes and Misdemeanors."[Node: LegalActor {id: "president_us"}]
[Node: LegalMechanism {id: "impeachment"}]
The Graph Connections:
(president_us)$\rightarrow$[EXTRACTED_FROM]$\rightarrow$(chunk_05)(impeachment)$\rightarrow$[EXTRACTED_FROM]$\rightarrow$(chunk_05)
3. How it is Used in Querying (Local Search)
Because this mapping exists, the system can provide incredibly accurate answers grounded in the exact source text.
Let's say a user asks: "What exact crimes can the President be impeached for?"
Entity Identification: The system (using keyword matching or an LLM router) identifies that the user is asking about the
[president_us]and[impeachment]nodes.Graph Traversal: The system jumps to those nodes in the Knowledge Graph. It looks at the relationships between them.
Fetching the Raw Text: Crucially, the system follows the
[EXTRACTED_FROM]edge backward to findchunk_05. It pulls the exact raw text of that chunk out of the node.Context Assembly: It bundles the raw text of
chunk_05together with the entity descriptions and passes them to the LLM to generate the final answer.
The Ultimate Benefit: Perfect Citations
Because every piece of knowledge is physically tethered to the original TextUnit node, the LLM isn't just generating an answer from memory; it is looking directly at the exact constitutional clause.
This allows your application to provide perfect traceability. When the AI answers the user's question, the user interface can provide a clickable footnote that says, "Source: Article II, Section 4," and display the exact, unedited text of the Constitution right next to the chat window.
Would you like to know anything else about how GraphRAG structures data, or perhaps how you would evaluate the accuracy of a system like this?
From <https://gemini.google.com/app/> Google Gemini 3.1 (Pro)

Comments
Post a Comment