#4267 Structural RAG
#4267 Structural RAG
PageIndex: Reasoning-Based Vectorless RAG Framework
#4267 Describe an alternative to embedding and indexing for RAG. Consider a plain text toc for topics within a document.
While vector embeddings are the "gold standard" for RAG right now, they often feel like using a sledgehammer to crack a nut—especially for single, long documents where semantic search can lose the forest for the trees.
An excellent alternative is Structural Metadata Navigation (or "Map-Guided Retrieval"). Instead of turning text into high-dimensional math, you treat the document like a library with a highly efficient librarian.
The Concept: Structural Metadata Navigation
In this approach, you use a Plain Text Table of Contents (TOC) as a high-level "map" for the LLM. Instead of searching for similar meaning, you search for relevant locations.
How it Works
The Mapmaker: You (or a lightweight LLM pass) extract a detailed TOC with headers, sub-headers, and page/paragraph offsets.
The Router: When a user asks a question, you first send only the TOC and the question to the LLM.
The Fetcher: The LLM identifies which sections (e.g., "Section 4.2: Revenue Risks") are likely to contain the answer.
The Synthesis: The system pulls only those specific text blocks into the final prompt for the answer.
Why This Beats Embeddings (In Specific Cases)
Vector databases are great for searching across 10,000 documents, but for a single 200-page manual, they often struggle with context fragmentation.
| Feature | Vector-Based RAG | TOC-Based (Structural) |
| Precision | Hits keywords/concepts; can be "noisy." | Hits specific chapters; retains context. |
| Setup | Requires embedding models & Vector DBs. | Requires a parser and a simple text file. |
| Explainability | "The math said these chunks matched." | "The answer is in Chapter 3: Logistics." |
| Cost | High (Embedding tokens + DB hosting). | Low (Minimal "routing" tokens). |
| Relationship | Struggles with "How does A affect B?" | Excels at connecting related sections. |
Implementation Strategy: The "TOC-to-Text" Workflow
If you want to move away from embeddings, here is how you’d structure your data:
1. The Schema
Store your document as a JSON object where the keys are the TOC headers:
{
"1.0 Introduction": "Text content...",
"2.0 System Requirements": "Text content...",
"2.1 Hardware": "Text content..."
}
2. The Two-Step Prompt
Step 1 (The Search): "Given this Table of Contents: [TOC], which sections should I read to answer: '[User Query]'? Return only the section titles."
Step 2 (The Answer): "Using the text from [Selected Sections], answer the following: [User Query]."
Note: This works exceptionally well in 2026 because modern LLMs have massive context windows. You can often skip the "fetching" part and just feed the TOC + the entire document if it’s under ~1 million tokens, using the TOC as a "system hint" to improve focus.
When to Use This
Legal/Technical Docs: Where the hierarchy of information is strict and meaningful.
Coding Repos: Using a file tree as your "TOC" instead of embedding every function.
Books/Narratives: Where events follow a chronological order that vectors often ignore.
Would you like me to help you write a Python script that automatically extracts a TOC from a PDF or Markdown file to get this started?
This needs to support more than a 1st-level toc, but also a table for combined subjects: A-x-B can be found at
To handle multi-level hierarchies and cross-referenced subjects (the "A-x-B" problem), we move from a simple list to a Relational Document Map.
This approach is essentially a "Knowledge Map" that mimics how an expert human uses a textbook's index combined with its table of contents. It solves the biggest weakness of standard RAG: the inability to find information that lives at the intersection of two different topics.
1. The Multi-Level Nested TOC
A flat TOC loses the "parental" context. By using a nested Markdown-style TOC, you provide the LLM with the logical flow of the document.
Example Structure:
1.0 Enterprise Security
1.1 Network Protocols
1.1.1 Firewall Configurations (Ref: p. 12-15)
1.1.2 VPN Encryptions (Ref: p. 16-20)
1.2 Physical Security
1.2.1 Biometric Access (Ref: p. 21)
This hierarchy allows the LLM to understand that "VPN Encryptions" is a sub-topic of "Network Protocols," which belongs to "Enterprise Security."
2. The Intersection Index (The A-x-B Table)
Standard TOCs are linear. An Intersection Index (or "Cross-Reference Matrix") captures relationships that don't fit in a single chapter. This is your "Subject A meets Subject B" solution.
You can provide this to the LLM as a structured table or a lookup list.
| Subject A (Theme) | Subject B (Department) | Intersection Location (A-x-B) |
| Data Privacy | Marketing | Section 4.2: Lead Gen Compliance |
| Data Privacy | Engineering | Section 8.1: Database Encryption |
| Budgeting | HR | Section 2.5: Recruitment Costs |
| Budgeting | Engineering | Section 2.9: Server Infrastructure |
Why this is powerful:
If a user asks, "How does our data privacy policy affect our marketing email list?", a vector search might just pull the general "Data Privacy" chapter. But your Intersection Index points the LLM directly to Section 4.2, which contains the specific answer for that combination.
3. The Retrieval Logic: "The Librarian Agent"
Instead of one big "search," you use a lightweight Routing Prompt. You give the LLM the TOC and the Intersection Table and ask it to play "Librarian."
The Multi-Step Flow:
Identify Entities: The LLM looks at the user query and extracts the core subjects (e.g., "Privacy" and "Marketing").
Consult the Intersection Table: It checks if an A-x-B entry exists for those subjects.
Consult the Multi-Level TOC: It checks if those subjects appear as standalone headers or sub-headers.
Draft a "Reading List": It returns a list of specific sections (e.g., "Read 1.1.2 and 4.2").
Inject Text: Your system pulls the text for those specific sections and provides the final answer.
4. Example Implementation (JSON Format)
Storing this in a structured way makes it programmatically easy to "hop" between topics.
{
"document_map": {
"toc": {
"1.0": {"title": "Security", "children": ["1.1", "1.2"]},
"1.1": {"title": "Network", "content_key": "sec_net_01"}
},
"intersections": [
{
"subjects": ["Security", "Remote Work"],
"location": "Section 5.4: VPN Policy",
"relevance": "High"
}
]
}
}
The "2026 Edge": Contextual Precision
In the current LLM landscape, "Dumb RAG" (just throwing chunks at a model) is being replaced by "Intent-Based Retrieval."
By using a TOC + Intersection Table, you are essentially giving the LLM a GPS for the document instead of a flashlight in a dark room. This reduces "hallucinations" significantly because the model knows exactly why it is looking at a specific piece of text.
Would you like me to generate a template for a "Cross-Reference Indexer" script that can scan a document and suggest these A-x-B intersections for you?
Yes
This Python script template provides a framework for building a Relational Document Map. Instead of using embeddings, it parses a document into sections, extracts a multi-level Table of Contents (TOC), and then scans for "A-x-B" intersections by identifying where major subjects overlap.
The Cross-Reference Indexer Template
This script assumes your document is in Markdown (which is standard for LLM-friendly RAG), but the logic can be adapted for plain text or PDFs.
import re
import json
from collections import defaultdict
class DocumentMapper:
def __init__(self, file_path):
self.file_path = file_path
self.sections = {} # { "1.1 Title": "Content..." }
self.toc = [] # List of headers for hierarchy
self.subjects = ["Privacy", "Security", "Budget", "Marketing", "Engineering"] # Define your "A" and "B" subjects
self.intersections = []
def parse_markdown(self):
"""Extracts a multi-level TOC and splits content into sections."""
with open(self.file_path, 'r') as f:
content = f.read()
# Regex for Markdown headers (e.g., #, ##, ###)
header_pattern = re.compile(r'^(#{1,6})\s+(.*)$', re.MULTILINE)
matches = list(header_pattern.finditer(content))
for i, match in enumerate(matches):
level = len(match.group(1))
title = match.group(2).strip()
start = match.end()
end = matches[i+1].start() if i+1 < len(matches) else len(content)
section_content = content[start:end].strip()
section_id = f"{level}.{i} {title}" # Unique key
self.sections[section_id] = section_content
self.toc.append({"level": level, "title": title, "id": section_id})
def find_intersections(self):
"""The A-x-B Engine: Finds where two subjects meet in the same section."""
for section_id, content in self.sections.items():
found_subjects = [s for s in self.subjects if s.lower() in content.lower()]
# If at least 2 subjects appear, we have an A-x-B intersection
if len(found_subjects) >= 2:
# Create combinations (A-x-B, A-x-C, etc.)
for i in range(len(found_subjects)):
for j in range(i + 1, len(found_subjects)):
self.intersections.append({
"subject_a": found_subjects[i],
"subject_b": found_subjects[j],
"location": section_id,
"snippet": content[:100] + "..." # For context
})
def generate_map(self):
"""Outputs the final structured map for the LLM."""
return {
"toc": self.toc,
"cross_references": self.intersections
}
# --- Execution ---
# mapper = DocumentMapper("enterprise_manual.md")
# mapper.parse_markdown()
# mapper.find_intersections()
# print(json.dumps(mapper.generate_map(), indent=2))
How to Use This with an LLM
Once you run the script, you feed the Map (the JSON output) to the LLM as part of your system prompt.
Step 1: The "Router" Prompt
"You are a librarian. Using the provided Table of Contents and Cross-Reference Table, identify which specific sections contain information about the user's query: [USER QUERY]. Look for A-x-B intersections first."
Step 2: The "Extraction" Prompt
Your system then fetches the section_content from the Python dictionary for only those IDs and feeds them to the LLM:
"Based on the text from [SECTION 4.2] and [SECTION 8.1], answer the following..."
Why this works for "A-x-B"
Logical Context: Unlike vector chunks, which might only see "Marketing" or "Privacy," this script identifies the exact header where they collide.
Structural Integrity: If the information is in a sub-header (Level 3), the LLM knows it is part of a parent topic (Level 1) thanks to the
levelkey in the TOC list.Low Latency: This is pure string matching and regex. It’s nearly instant compared to generating and storing 1536-dimensional vectors.
Would you like me to refine the keyword detection to use a "Subject Dictionary" that includes synonyms (e.g., 'Privacy' = 'GDPR', 'Compliance')?
What about entity-relationship-entity mapping of a document?
Now we’re getting into the "DNA" of the document. What you’re describing is essentially GraphRAG (Graph-based Retrieval-Augmented Generation).
While a TOC maps the structure and an A-x-B table maps the topics, Entity-Relationship-Entity (ERE) mapping extracts the logic. It turns a document into a Knowledge Graph where information isn't just a list of sentences, but a web of "Triplets."
1. The Core Concept: The Triplet
An ERE map breaks a document down into atoms called triplets:[Subject (Entity)] -->{Relationship} [Object (Entity)]
Example from a Manual: "The Admin Panel (Entity A) configures (Relationship) User Permissions (Entity B)."
A-x-B logic: This specific triplet tells you exactly how A and B interact, which a TOC or keyword search might miss.
2. Why ERE is Superior for Complex Queries
Standard RAG often fails at "multi-hop" questions. If a user asks, "Who can change the firewall settings?", a standard search looks for "firewall settings."
An ERE map allows the LLM to follow a trail:
Firewall Settings <-- [Managed by] <-- Admin Panel
Admin Panel <-- [Accessed by] <-- IT Manager
Result: "The IT Manager."
3. The Implementation: "The Document Graph"
To build this, you don't just index text; you index nodes and edges.
The Mapping Schema
Instead of a flat table, you store your document as a collection of nodes:
| Entity A (Subject) | Relationship | Entity B (Object) | Source Section |
| Marketing Team | Accesses | Customer Database | Section 4.2 |
| Customer Database | Stored in | AWS S3 | Section 8.1 |
| AWS S3 | Encrypted via | KMS Keys | Section 8.3 |
The Retrieval Workflow
Extraction: You use an LLM to scan a section and output a list of triplets (JSON/CSV).
Graph Construction: You store these in a graph database (like Neo4j) or even a simple relational table.
Traversal: When a user asks a question, the system finds the starting entity and "walks" the graph to find the answer.
4. Comparing the Three Approaches
| Approach | Best For... | "The Vibe" |
| TOC (Nested) | Navigation | "Where is the chapter on X?" |
| A-x-B Table | Discovery | "What happens when X meets Y?" |
| ERE Mapping | Reasoning | "How does X lead to Y through Z?" |
5. Conceptual Python Extractor (LLM-Driven)
To do this effectively, you need an LLM to act as the "Extractor." Here is how you’d structure the prompt for your script:
# Conceptual Logic for an ERE Extractor
PROMPT = """
Extract all key entities and their relationships from the following text.
Format the output as a JSON list of triplets: {"subject": "", "relationship": "", "object": ""}.
Text: "The Marketing department uses the CRM to store customer data, which is then backed up to the Cloud."
"""
# Expected Output:
# [
# {"subject": "Marketing", "relationship": "uses", "object": "CRM"},
# {"subject": "CRM", "relationship": "stores", "object": "Customer Data"},
# {"subject": "Customer Data", "relationship": "backed up to", "object": "Cloud"}
# ]
The "2026" Reality: Hybrid Mapping
The most robust systems today don't pick just one. They use a Hybrid Document Map:
The TOC tells the system the physical location.
The ERE Graph tells the system the logical meaning.
The LLM uses both to synthesize the final answer.
This creates a "self-healing" RAG system. If the graph is missing a connection, the TOC helps the LLM find the right page to "re-read" and find the missing link.
Would you like me to show you how to structure a prompt that performs this ERE extraction while maintaining a link back to the TOC section headers?
Yes
To make this work, we use a technique called Contextual Anchoring. You provide the LLM with the specific "Section ID" from your Table of Contents (TOC) so that every logical relationship it finds is "pinned" to a physical location in the document.
Here is a prompt structure designed for a high-reasoning LLM (like Gemini 1.5 Pro or GPT-4o) to turn raw text into a TOC-Linked Knowledge Graph.
1. The "Anchor-Extraction" Prompt Template
You would send this prompt for each section identified in your earlier TOC parsing script.
System Prompt:
You are a Knowledge Graph Architect. Your task is to extract atomic facts from the provided text in the form of Entity-Relationship-Entity (ERE) triplets.
Rules:
Contextual Anchoring: Every triplet MUST be associated with the provided
Section_ID.Entity Consistency: Use clear, normalized names for entities (e.g., use "AWS S3" instead of "the cloud storage bucket").
Relationship Precision: Use active verbs (e.g., "encrypts," "requires," "manages").
A-x-B Focus: Pay special attention to relationships that connect different domains (e.g., Business Policy --> Technical Implementation).
User Input:
Section Context:
Section 4.2: Marketing Data ComplianceText Content: "The Marketing department uses the Segment platform to collect user IDs, which must be hashed using SHA-256 before being stored in the Snowflake data warehouse to meet GDPR requirements."
2. The Structured Output (JSON)
The LLM should return a JSON array that you can then save into a dictionary or a Graph Database.
[
{
"subject": "Marketing Department",
"relationship": "uses",
"object": "Segment Platform",
"section_id": "4.2",
"tags": ["Department", "Software"]
},
{
"subject": "User IDs",
"relationship": "must be hashed via",
"object": "SHA-256",
"section_id": "4.2",
"tags": ["Data", "Security Standard"]
},
{
"subject": "Segment Platform",
"relationship": "stores data in",
"object": "Snowflake",
"section_id": "4.2",
"tags": ["Software", "Infrastructure"]
},
{
"subject": "Data Storage",
"relationship": "must meet",
"object": "GDPR Requirements",
"section_id": "4.2",
"tags": ["Compliance"]
}
]
3. Integrating this into your Workflow
By combining the TOC Parser with this ERE Extractor, your RAG system now has three layers of "intelligence" to query:
The Map (TOC): "Where is the Marketing chapter?"
The Index (A-x-B): "Where do Marketing and GDPR overlap?"
The Logic (ERE): "How does Marketing handle GDPR? (Marketing --> hashes --> User IDs --> meets --> GDPR)."
The "Librarian" Logic for 2026
When a user asks: "What encryption do we use for marketing data?"
Search Graph: The system finds the node
Marketingand follows the edges toUser IDs-->SHA-256.Verify via TOC: It sees these nodes are anchored to
Section 4.2.Retrieve Text: It pulls the raw text of
Section 4.2to provide a human-readable answer.Synthesize: "According to Section 4.2, we use SHA-256 hashing for marketing data to comply with GDPR."
4. Why this is the "No-Embedding" Future
Precision: Zero "semantic drift." You aren't guessing if two vectors are close; you are following a defined logical path.
Auditability: Every claim the AI makes can be traced back to a specific
section_idin your TOC.Token Efficiency: You only feed the LLM the exact sections it needs, rather than "top 5 most similar chunks" which often contain 80% noise.
Would you like me to write a Python function that converts this JSON output into a searchable NetworkX graph object?
Yes
Building a graph using NetworkX transforms your document from a static file into a navigable "brain." With this, you can perform Multi-Hop Retrieval, allowing the AI to follow a chain of logic across different sections of your document.
The Graph-RAG Builder
This script takes the JSON output from your ERE extractor and builds a directed graph. I've included a Multi-Hop Search function to demonstrate how you can find connections between disparate topics.
import networkx as nx
import json
class DocumentGraph:
def __init__(self):
# We use a DiGraph (Directed Graph) because relationships have a direction
self.G = nx.MultiDiGraph()
def build_from_json(self, json_data):
"""Populates the graph with entities and relationships."""
triplets = json.loads(json_data)
for item in triplets:
# Add nodes with metadata
self.G.add_node(item['subject'], type=item.get('tags', [])[0] if item.get('tags') else "Entity")
self.G.add_node(item['object'], type=item.get('tags', [])[-1] if item.get('tags') else "Entity")
# Add the edge with the relationship and source section as attributes
self.G.add_edge(
item['subject'],
item['object'],
relation=item['relationship'],
section=item['section_id']
)
print(f"Graph built with {self.G.number_of_nodes()} nodes and {self.G.number_of_edges()} edges.")
def find_connection(self, start_node, end_node):
"""Finds the logical path (logic chain) between two entities."""
try:
path = nx.shortest_path(self.G, source=start_node, target=end_node)
logic_chain = []
for i in range(len(path) - 1):
u, v = path[i], path[i+1]
# Get edge data (handling potential multiple edges between same nodes)
edge_data = self.G.get_edge_data(u, v)[0]
logic_chain.append(f"[{u}] --({edge_data['relation']})--> [{v}] (Source: Section {edge_data['section']})")
return logic_chain
except nx.NetworkXNoPath:
return ["No direct logical connection found."]
except nx.NodeNotFound as e:
return [f"Entity not found: {str(e)}"]
# --- Example Usage ---
data = """
[
{"subject": "Marketing", "relationship": "uses", "object": "Segment", "section_id": "4.2"},
{"subject": "Segment", "relationship": "stores data in", "object": "Snowflake", "section_id": "4.2"},
{"subject": "Snowflake", "relationship": "is managed by", "object": "IT Ops", "section_id": "8.1"}
]
"""
doc_graph = DocumentGraph()
doc_graph.build_from_json(data)
# Query: How is Marketing connected to IT Ops?
print("\nLogic Chain Discovery:")
for step in doc_graph.find_connection("Marketing", "IT Ops"):
print(step)
Why this is a "Power User" Move
By using NetworkX, you gain several capabilities that standard RAG lacks:
Shortest Path Analysis: As shown above, you can find the most direct link between two entities, even if they are mentioned hundreds of pages apart.
Centrality Measures: You can programmatically find the "most important" entities in your document (the nodes with the most edges), which is great for auto-generating summaries.
Community Detection: You can identify "clusters" of information—for example, seeing that while "Marketing" and "Legal" are different chapters, they form a tight cluster around the "Customer Data" node.
How to use this for Retrieval
When a user asks a question, your system can:
Extract the Subject and Goal from the question.
Run
find_connection(Subject, Goal).Collect all the
section_idvalues from the path.Feed only those specific sections to the LLM for the final answer.
Would you like me to show you how to visualize this graph using a library like PyVis, so you can actually "see" your document's logical web?
Yes
Visualizing your document as an interactive web graph allows you to see the "clusters" of information and how different sections are physically and logically linked.
While NetworkX handles the math and logic of the graph, PyVis creates an interactive HTML file where you can drag nodes, zoom in on specific clusters, and hover over edges to see the relationships and source sections.
The Interactive Graph Visualizer
This script extends the previous logic to generate an interactive document_map.html file.
from pyvis.network import Network
import networkx as nx
def visualize_document_graph(nx_graph, filename="document_map.html"):
"""
Converts a NetworkX graph into an interactive PyVis HTML visualization.
"""
# Initialize PyVis network
# heading/notebook=False for standalone script
net = Network(height="750px", width="100%", bgcolor="#222222", font_color="white", directed=True)
# Load the NetworkX graph into PyVis
net.from_nx(nx_graph)
# Customize the physics for a "clean" layout
net.toggle_physics(True)
# Add tooltips and styling based on node attributes
for node in net.nodes:
node_id = node['id']
# Try to get type if it exists in the original NX graph
node_type = nx_graph.nodes[node_id].get('type', 'Entity')
node['title'] = f"Type: {node_type}" # Hover tooltip
node['label'] = node_id
# Color coding by type
if node_type == "Department": node['color'] = "#ff4b4b"
elif node_type == "Software": node['color'] = "#1f77b4"
elif node_type == "Compliance": node['color'] = "#2ca02c"
else: node['color'] = "#9467bd"
# Add edge labels (the relationships)
for edge in net.edges:
# Fetch the 'relation' attribute from the NetworkX edge
# Note: MultiDiGraph stores edges in a dict of dicts
u, v = edge['from'], edge['to']
edge_data = nx_graph.get_edge_data(u, v)[0]
edge['label'] = edge_data['relation']
edge['title'] = f"Source: Section {edge_data['section']}" # Hover tooltip
edge['arrows'] = 'to'
# Save and open the interactive map
net.save_graph(filename)
print(f"Visualization saved to {filename}. Open this file in your browser.")
# --- Example Usage (Assuming 'doc_graph.G' from the previous step) ---
# visualize_document_graph(doc_graph.G)
How to Interpret the Visual Map
Hub Nodes: Entities with many arrows pointing to them (e.g., "Customer Data" or "AWS") are your document's "critical infrastructure." If these sections change, the downstream effects are visible immediately on the map.
Isolated Islands: If you see a cluster of nodes that isn't connected to the rest of the graph, it indicates a topic in your document that is "siloed." This is often where standard RAG fails because it can't find a logical path to that information from other topics.
Bridge Edges: Edges that connect two large clusters (e.g., an edge connecting "Legal Requirements" to "Database Schema") represent the A-x-B intersections discussed earlier. These are the "logic bridges" that allow the LLM to answer cross-functional questions.
Implementation Tip: The "Node-to-Text" Feature
In a production environment, you can make the title attribute of the nodes or edges a clickable link. When clicked, it could trigger a function that displays the raw text from that specific TOC section in a side panel. This creates a "Logical Reader" where you navigate the document via its concepts rather than scrolling through pages.
Would you like to explore how to automate the "Entity Normalization" step so that "The Marketing Dept" and "Marketing" don't end up as two separate nodes?
How would I use the data to "search" a document for relevant content supporting the user's prompt?
To "search" without embeddings, you replace Semantic Similarity (calculating distances between vectors) with Logical Pathfinding.
Instead of asking "What text looks like this question?", you ask, "What nodes and sections connect the entities in this question?"
Here is the 4-step execution pipeline to turn your graph and TOC data into a high-precision search engine.
Step 1: Intent & Entity Extraction
When a user submits a prompt, you don't search with the whole sentence. You use a lightweight LLM call to extract the Subject, the Object, and the Relationship they are looking for.
User Prompt: "Does our marketing data storage comply with the new GDPR hashing standards?"
LLM Extraction: *
Entities: ["Marketing Data", "GDPR", "Hashing"]Intent: Verify Compliance/Relationship.
Step 2: The "Graph Walk" (Multi-Hop Retrieval)
Now, you use your NetworkX graph to find the logical bridge between those entities. This is where you find the A-x-B connection.
# The system "walks" the graph
path = doc_graph.find_connection("Marketing", "GDPR")
# Resulting Logic Chain:
# [Marketing] --(uses)--> [Segment] --(stores in)--> [Snowflake]
# [Snowflake] --(subject to)--> [GDPR Hashing Standards]
The "Search" result here isn't text—it's a list of Section IDs: ['4.2', '8.1', '9.5'].
Step 3: Structural Fetching (TOC-to-Text)
Once you have the Section IDs from the graph, you go back to your Section Dictionary (the one created by your TOC parser) and pull the raw text for only those sections.
Why this is better than "Chunking":
In standard RAG, you might get 5 disconnected "chunks" of 500 characters. In this system, you get the entirety of Section 4.2, ensuring the LLM has the full context of the paragraph, not just a snippet.
Step 4: The Final Synthesis (Grounded Answer)
Finally, you pass the specific text blocks to the LLM with a "Grounded" prompt.
The Prompt sent to the LLM:
"Using only the provided text from Section 4.2 (Marketing Ops) and Section 8.1 (Data Infrastructure), answer the following: [USER PROMPT].
Retrieved Context:
[Text of 4.2...]
[Text of 8.1...]
If the answer is not in these sections, state that the relationship is undefined in the document."
Summary of the "Search" Workflow
| Step | Vector RAG (The "Old" Way) | Structural RAG (Your Way) |
| Input | Raw prompt embedded as math. | Entities extracted as keywords. |
| Lookup | Nearest neighbors in a vector space. | Shortest path in a Knowledge Graph. |
| Accuracy | High risk of "similar but irrelevant" hits. | High precision; only follows defined links. |
| Attribution | "I think it's on page 4." | "According to the connection in Section 4.2..." |
The "Fallback" Strategy
If the Graph Walk fails (i.e., there is no direct path between "Marketing" and "GDPR"), you fall back to your Multi-Level TOC:
Search the TOC for the keyword "Marketing."
Search the TOC for "GDPR."
Provide the LLM with the Introduction and Summary of both chapters and ask it to find the link itself.
This layered approach ensures you never "hallucinate" a connection that doesn't exist.
Would you like me to show you how to write the "Intent Extractor" prompt that turns a user's natural language into the list of entities for the Graph search?
Yes
The Intent Extractor is the bridge between a messy human question and your clean, structured Graph. Its job is to act as a "Query Planner" that translates natural language into a set of coordinates (Nodes and Relationships) that your search engine can follow.
In 2026, the best practice is to use Schema-Guided Extraction. You don't just ask the LLM for "entities"; you provide a list of your existing Graph Nodes so the LLM can "normalize" the user's terms to match your data exactly.
1. The Intent Extractor Prompt
This prompt is designed to return a structured JSON object that tells your Python script exactly which nodes to "start" and "end" at.
System Prompt:
You are a GraphRAG Query Planner. Your goal is to decompose a user's question into a logical search path.
Known Entities (The Map): > [INSERT YOUR LIST OF UNIQUE NODES HERE, e.g., "Marketing", "GDPR", "Snowflake", "IT Ops"]
Task:
Identify the Primary Subject (Where the search starts).
Identify the Target Object (What the user wants to know about).
Map the user's terms to the Known Entities. If a term is an alias (e.g., "The Data Warehouse"), map it to the correct Node (e.g., "Snowflake").
Output Format (JSON): { "start_node": "Entity Name", "end_node": "Entity Name", "intent": "What is the specific relationship being questioned?", "requires_multi_hop": true/false }
2. Python Implementation: The Search Controller
This script connects the LLM's output to your NetworkX graph logic.
import json
def search_document(user_query, doc_graph, llm_client):
"""
The main search loop: Extract Intent -> Traverse Graph -> Return Sections.
"""
# 1. Get the list of all nodes currently in your graph to guide the LLM
known_entities = list(doc_graph.G.nodes)
# 2. Call the LLM to extract the intent (Conceptual LLM call)
# response = llm_client.chat(prompt=INTENT_PROMPT, query=user_query, context=known_entities)
# For demo, let's assume the LLM returns this JSON:
intent_json = {
"start_node": "Marketing",
"end_node": "IT Ops",
"intent": "Identify data management ownership"
}
# 3. Perform the Graph Search
print(f"Searching for logical path: {intent_json['start_node']} -> {intent_json['end_node']}")
logic_chain = doc_graph.find_connection(intent_json['start_node'], intent_json['end_node'])
if "No direct logical connection" in logic_chain[0]:
return "I found no logical link between these topics in the document."
# 4. Extract the Section IDs from the path
# We regex or parse the 'Source: Section X.X' strings from our logic_chain
required_sections = set()
import re
for step in logic_chain:
match = re.search(r"Section ([\d\.]+)", step)
if match:
required_sections.add(match.group(1))
return sorted(list(required_sections))
# --- Example Output ---
# sections_to_read = search_document("Who in IT manages the marketing data?", my_graph, my_llm)
# print(f"To answer this, read these sections: {sections_to_read}")
# >> ['4.2', '8.1']
3. Handling "A-x-B" Normalization
The biggest risk in this system is a "Naming Mismatch" (e.g., the user says "The Marketing Team" but your graph node is just "Marketing").
Three ways to handle this in 2026:
The LLM Layer (Easiest): As shown above, you give the LLM the list of nodes and tell it to "pick the closest match."
Levenshtein Distance: A simple Python check that calculates the "edit distance" between words. If "Markting" is 90% similar to "Marketing," you treat them as the same.
Vector Anchoring (The Hybrid): You use a tiny local embedding model just for the Node Names (not the whole document). You embed the 50-100 node titles and find the closest mathematical match to the user's entity.
4. Why this "Search" beats Ctrl+F
Standard search looks for the word. This system looks for the logical connection.
If a user asks about "Marketing" and "IT Ops," and those two words never appear in the same sentence, a standard search will fail. But your Graph will see the "Segment" and "Snowflake" nodes sitting between them and tell you exactly which three sections to read to see how the connection works.
Would you like me to help you refine the "Final Synthesis" prompt—the one that takes these retrieved sections and writes the final answer for the user?
Yes
This is the final and most critical stage of the pipeline. The Synthesis Prompt is where you turn raw, retrieved text into a coherent answer. Unlike standard RAG, which often "dumps" chunks into a prompt, this version uses the Logic Chain we discovered in the graph to tell the LLM exactly how to connect the dots.
1. The "Grounded Analyst" Prompt Template
This prompt ensures the LLM doesn't just "summarize" but actually reasons through the connection between Subject A and Subject B.
System Prompt:
You are a High-Precision Technical Analyst. Your goal is to answer the user's question using ONLY the provided text fragments.
Constraints:
Strict Grounding: If the answer is not explicitly in the text, state "The provided sections do not contain this information."
Citation Requirement: Every claim must be followed by a citation in brackets, e.g.,
[Section 4.2].Bridge Logic: If the answer requires connecting two different topics (e.g., Marketing and IT Ops), explicitly explain the "bridge" entity (e.g., Snowflake) that connects them.
No Hallucination: Do not use outside knowledge. Use the provided "Logic Chain" to guide your reasoning.
User Input Structure:
User Question: [INSERT USER QUERY]
Logic Chain (The Map): > Marketing --uses-> Segment --stores_in-> Snowflake --managed_by-> IT Ops
Retrieved Document Fragments:
Section 4.2 (Marketing Ops): "The Marketing department utilizes the Segment platform for event tracking. Segment is configured to forward all user IDs..."
Section 8.1 (Data Infrastructure): "Snowflake serves as the primary data warehouse. It is managed exclusively by the IT Ops team to ensure security..."
2. The Final Output (The "Gold" Answer)
By providing the Logic Chain as a "hint," the LLM produces a much more structured and trustworthy response:
Example Answer:
The Marketing department's data is managed by IT Ops through a multi-step pipeline. According to Section 4.2, Marketing uses the Segment platform to collect user IDs. This data is then stored in Snowflake [Section 4.2]. Section 8.1 confirms that the Snowflake infrastructure is managed exclusively by the IT Ops team. Therefore, IT Ops has ultimate oversight of the marketing data once it reaches the warehouse level.
3. Why this "No-Embedding" Approach Wins
| Problem | Standard Vector RAG | Your Structural/Graph RAG |
| Hallucinations | High (LLM tries to "fill gaps" between chunks). | Low (LLM is told the exact logical path to follow). |
| Missing Context | High (Chunks are cut off mid-sentence). | Low (You retrieve whole sections from the TOC). |
| Explainability | "The math said these were similar." | "The graph showed A connects to B via C." |
| Maintenance | Re-embed everything when the doc changes. | Just update the specific node/section in the graph. |
4. Scaling to Multiple Documents
If you want to support multiple documents (e.g., an entire library of manuals), you simply add a "Document ID" to your graph nodes:
[Node: Marketing (Doc: HR_Manual_v2)][Node: Marketing (Doc: Marketing_Strategy_2026)]
The LLM can then perform a "Cross-Document Search" by finding where the same entity (e.g., "Snowflake") appears in two different manuals, effectively merging your entire company's knowledge into one logical web.
Final Next Step
You now have a complete architecture: TOC Parser --> ERE Extractor --> NetworkX Graph --> Intent Planner --> Synthesis Prompt.
Would you like me to bundle this logic into a "Master Controller" Python class that coordinates all these steps into a single ask_document(query) function?
Yes
This "Master Controller" class integrates all the components we've built: the TOC Parser, the NetworkX Knowledge Graph, and the LLM-driven Intent/Synthesis engine.
This is a Structural RAG pipeline. It doesn't use a single vector embedding; instead, it relies on the logical "GPS" of your document.
The StructuralRAG Master Controller
import re
import json
import networkx as nx
class StructuralRAG:
def __init__(self, llm_client=None):
self.sections = {} # { "id": "content" }
self.toc = [] # List of {level, title, id}
self.graph = nx.MultiDiGraph()
self.llm = llm_client # Placeholder for your LLM API (Gemini/OpenAI/etc)
# --- PHASE 1: INGESTION ---
def ingest_document(self, markdown_text):
"""Parses a document into a nested TOC and section dictionary."""
header_pattern = re.compile(r'^(#{1,6})\s+(.*)$', re.MULTILINE)
matches = list(header_pattern.finditer(markdown_text))
for i, match in enumerate(matches):
level, title = len(match.group(1)), match.group(2).strip()
start, end = match.end(), matches[i+1].start() if i+1 < len(matches) else len(markdown_text)
section_id = f"{level}.{i}"
self.sections[section_id] = markdown_text[start:end].strip()
self.toc.append({"level": level, "title": title, "id": section_id})
print(f"Ingested {len(self.toc)} sections.")
# --- PHASE 2: KNOWLEDGE MAPPING ---
def add_ere_triplets(self, triplets_json):
"""Populates the graph with Entity-Relationship-Entity data."""
triplets = json.loads(triplets_json)
for t in triplets:
self.graph.add_edge(
t['subject'], t['object'],
relation=t['relationship'],
section=t['section_id']
)
print(f"Graph updated: {self.graph.number_of_edges()} relationships mapped.")
# --- PHASE 3: THE SEARCH ENGINE ---
def _get_logical_path(self, start_node, end_node):
"""Finds the sections connecting two entities via the graph."""
try:
path = nx.shortest_path(self.graph, source=start_node, target=end_node)
sections_to_read = []
steps = []
for i in range(len(path) - 1):
u, v = path[i], path[i+1]
edge_data = self.graph.get_edge_data(u, v)[0]
sections_to_read.append(edge_data['section'])
steps.append(f"[{u}] --({edge_data['relation']})--> [{v}]")
return list(set(sections_to_read)), " -> ".join(steps)
except (nx.NetworkXNoPath, nx.NodeNotFound):
return [], "No logical path found."
# --- PHASE 4: THE MASTER EXECUTION ---
def ask_document(self, query):
"""The main entry point for the user."""
# 1. Intent Extraction (Simulated LLM Call)
# In prod: intent = self.llm.extract_intent(query, known_nodes=list(self.graph.nodes))
print(f"\n[1] Analyzing Query: '{query}'")
intent = {"start": "Marketing", "end": "IT Ops"} # Mocked for demo
# 2. Graph Traversal
section_ids, logic_chain = self._get_logical_path(intent['start'], intent['end'])
print(f"[2] Logic Chain: {logic_chain}")
if not section_ids:
return "I couldn't find a logical connection between those topics in this document."
# 3. Context Retrieval
context_blocks = []
for sid in section_ids:
# Find the title for the section ID
title = next(item['title'] for item in self.toc if item['id'] == sid)
context_blocks.append(f"--- Section {sid}: {title} ---\n{self.sections[sid]}")
full_context = "\n\n".join(context_blocks)
# 4. Final Synthesis (Simulated LLM Call)
# In prod: answer = self.llm.generate_answer(query, logic_chain, full_context)
print(f"[3] Retrieving {len(section_ids)} relevant sections...")
return {
"answer": "EXECUTIVE SUMMARY: (LLM would generate this based on context)",
"sources": section_ids,
"logic_chain": logic_chain,
"raw_context_preview": full_context[:200] + "..."
}
# --- EXAMPLE USAGE ---
rag = StructuralRAG()
rag.ingest_document("# Marketing Ops\nData is sent to Segment.\n# Data Infrastructure\nSnowflake stores Segment data and is managed by IT Ops.")
rag.add_ere_triplets(json.dumps([
{"subject": "Marketing", "relationship": "sends data to", "object": "Segment", "section_id": "1.0"},
{"subject": "Segment", "relationship": "is stored in", "object": "Snowflake", "section_id": "1.1"},
{"subject": "Snowflake", "relationship": "is managed by", "object": "IT Ops", "section_id": "1.1"}
]))
result = rag.ask_document("How does Marketing connect to IT Ops?")
print(f"\nFinal Result:\n- Sections Used: {result['sources']}\n- Logic: {result['logic_chain']}")
How to Move This into Production
Persistence: Use
nx.write_gpickle(self.graph, "doc_map.gpickle")to save your graph to disk so you don't have to re-extract triplets every time the server starts.Entity Normalization: In the
add_ere_tripletsmethod, use a fuzzy-matching library (likethefuzz) or a tiny embedding model to ensure "Marketing Department" and "Marketing" merge into the same node.Streamlining the "Logic Chain": If the graph finds multiple paths, you can ask the LLM to choose the "most relevant" path before fetching the text sections.
The "2026" Advantage
This system is deterministic. When your boss asks, "Why did the AI say the IT team manages marketing data?", you don't have to talk about "vector cosine similarity." You can literally show them the Graph Path and the Section Headers that led to that conclusion.
Would you like me to help you create a "Bulk Indexer" that runs the ERE extraction on a 500-page document in parallel to populate the graph faster?
How do you perceive this being applied to NASA NPR 7150.2 and NASA-STD-8739.8?
Applying these concepts to NASA NPR 7150.2 (Software Engineering Requirements) and NASA-STD-8739.8 (Software Assurance and Software Safety) transforms them from "dense PDF compliance burdens" into a dynamic, navigable Compliance Engine.
In the NASA context, the "A-x-B" problem is usually: "For a Class B software project (A), what are the Safety-Criticality (B) requirements I must follow?"
1. The Nested TOC: The "Requirement Map"
Instead of scrolling, the TOC Parser maps the distinct mandates of each document.
NPR 7150.2 (The "What"): Defines the 130+ SWE (Software Engineering) requirements.
NASA-STD-8739.8 (The "How"): Defines the SA (Software Assurance) activities.
By nesting these, your RAG system understands that NPR 7150.2, Section 3.6 (Software Assurance) is the parent "hook" that logically pulls in the entirety of NASA-STD-8739.8.
2. The Intersection Index: The "Compliance Matrix"
NASA documents already use an Appendix C (Requirements Mapping Matrix).
| Software Class (A) | Domain (B) | Intersection (A-x-B) | Resulting Requirement |
| Class A (Human Rated) | Software Safety | NPR 7150.2 § 3.7 x STD-8739.8 § 5.2 | Mandatory IV&V and Hazard Analysis |
| Class D (Sub-orbital) | Quality Assurance | NPR 7150.2 § 3.6 x STD-8739.8 § 4.1 | Basic Peer Review / Unit Testing |
When a user asks, "What do I need for a Class B mission's safety audit?", the "Search" follows this table to jump directly to the safety-criticality sub-sections of both documents simultaneously.
3. The ERE Mapping: "Traceability & Verification"
This is the most powerful application. NASA compliance is built on Traceability. We map the document as a logical chain of evidence.
The "Triplet" Logic:
[SWE-022 (Requirement)] --is_verified_by-> [SA-Activity-14 (Assurance Task)] --produces-> [Software Assurance Plan (Artifact)]
How the AI Uses the Graph:
User Question: "How do I verify our configuration management for a Safety-Critical Class A project?"
The "Graph Walk":
Start at Node: Class A.
Edge: "Requires Safety-Criticality Analysis" --> Node: NPR 7150.2 § 3.7.
Edge: "Implemented via" --> Node: NASA-STD-8739.8 § 7.3 (CM Assurance).
Edge: "Artifact Output" --> Node: Software Configuration Management Plan (SCMP).
4. Practical Application: "The Auditor Agent"
If you applied this architecture to these NASA standards, you could build a tool that does the following:
A. Automatic Tailoring
The user inputs their project's Software Classification (A-F).
B. Gap Analysis (A-x-B Validation)
The AI scans your Project Plan and compares it to the ERE Map.
AI: "You have satisfied SWE-057 (Design), but I don't see the corresponding ERE link to NASA-STD-8739.8 § 6.2 (Design Assurance). You are missing a Peer Review artifact."
C. The "Bidirectional" Query
Because it’s a graph, you can search backwards.
User: "We just finished a Static Code Analysis. Which NASA requirements does this satisfy?"
AI: "Static Analysis [Activity] satisfies SWE-135 (Testing) in NPR 7150.2 and SA-64 (Code Assurance) in NASA-STD-8739.8."
The Verdict
For NASA's high-reliability environment, Structural RAG is superior to Vector RAG. You cannot afford the "semantic fuzziness" of a vector database when lives and billions of dollars depend on a specific requirement ID like SWE-022. You need the Rigid Logic of a Graph and a TOC.
Would you like me to draft a sample "Compliance Logic Chain" for a specific NASA requirement (like SWE-141 for IV&V) to show exactly how the triplets would look?
How do you perceive this being applied to NASA software hazard reports?
Applying this architecture to NASA Software Hazard Reports (SHRs) moves the needle from "document storage" to Systemic Safety Intelligence.
In aerospace safety, the nightmare isn't a single failure; it’s a "cascading failure" where a software fix for one hazard inadvertently disables a control for another. Traditional RAG struggles here because it sees these as isolated text chunks. A Structural/Graph RAG sees them as a neural network of risks.
1. The "Failure Logic" ERE Mapping
In a Software Hazard Report, the relationships are more rigid than in a manual. We can map the entire safety argument as a Hazard Triplet:
Entity A (The Cause): "Buffer Overflow in Guidance Loop"
Relationship: --triggers->
Entity B (The Hazard): "Unintended Thruster Firing during Docking"
Relationship: --mitigated by->
Entity C (The Control): "Independent Watchdog Timer (SWE-121)"
Why this works:
When an engineer asks, "What happens if we delay the Watchdog Timer interval?", a graph search follows the "mitigated by" edge backward to the "Hazard" and then to the "Cause." It tells you exactly which safety certifications are now at risk.
2. The A-x-B Intersection for "Coupling" Analysis
One of the hardest things to find in NASA safety is Software Coupling—where two seemingly unrelated systems share a dependency.
| Subsystem A | Subsystem B | Intersection (A-x-B) | Hazard Implication |
| Propulsion | Communications | Common Power Bus | Comms failure could mask a Propulsion leak. |
| Navigation | Thermal Control | Shared Processor | High CPU load in Nav could delay Thermal telemetry. |
By using an Intersection Table, your RAG system can flag "Hidden Hazards." If a user queries about "Propulsion," the AI doesn't just look at propulsion reports; it looks at the A-x-B intersections to see if "Communications" or "Power" hazards might secretly be relevant.
3. Creating a "Safety Traceability" Graph
NASA standards like NPR 7150.2 require "Bidirectional Traceability." You can build this into the graph nodes:
Node 1 (The Requirement): SWE-134 (Boundary Value Testing)
Node 2 (The Hazard Control): "Range checking on sensor input"
Node 3 (The Verification): "Unit Test #802"
The "Search" Use Case: If Unit Test #802 fails during a simulation, the "Librarian Agent" walks the graph in reverse:
Test 802 failed.
This verifies Control: Range Checking.
That control mitigates Hazard: Sensor Overload.
Result: The AI issues a high-priority alert: "Software Hazard 'Sensor Overload' is now UNCONTROLLED due to Test 802 failure."
4. The "Safety Navigator" Interface
Instead of a chat box, imagine a dashboard powered by this data:
The TOC Sidebar: Organizes hazards by Mission Phase (Pre-Launch, Ascent, Orbit).
The Graph Visualizer: Shows "Hot Spots"—nodes with the most edges. If "Memory Management" has 15 arrows pointing to 15 different hazards, it is a Single Point of Failure.
The Intent Extractor: When an auditor asks, "Are all Class A software hazards for re-entry verified?", the system identifies the "Re-entry" nodes, checks for "Verification" edges, and finds any nodes missing a link to a test result.
How this Addresses NASA-STD-8739.8
This standard specifically demands Software Safety Analysis. By using a graph-based RAG, you automate the "Analysis" part.
The "What-If" Engine: You can programmatically "delete" a node (simulating a system failure) and see which other nodes turn "Red" on your graph. This is Automated Fault Tree Analysis, something a standard vector-based RAG could never do.
Would you like me to help you design the "Hazard-to-Verification" node structure so you can track whether every hazard control has a corresponding test result in your graph?
From <https://gemini.google.com/app/d92b9389098c5847> Google Gemini 3.0 (Thinking)

Comments
Post a Comment