#4345 PageIndex Locally
#4345 PageIndex Locally
Cloud: https://dash.pageindex.ai/subscription
Local: https://github.com/VectifyAI/PageIndex
- If I were to subscribe to PageIndex for $30 for 1-month and 1000 page credits, where do my ingested documents reside, and can I get to them next month after cancelling my subscription?
- How do I install PageIndex locally, requiring no PageIndex API Key or subscription?
- Would "pip install pageindex==0.3.0.dev1" be sufficient, or do I still need to download the GIT repository?
- So when a document is indexed using PageIndex, does the document remain in real-time memory only, or is it save to a specified file store?
- Describe a Python application with GUI to select any one or multiple of already indexed documents for query
- Rewrite using the Pypi Ollama library
- Would it work to extend this application to support user queries to an indexed document, then use that retrieved text as part of a prompt to assess a separate indexed document via LLM?
- A useful option would be to select a set of documents, prioritized for search order, then sequence through each document until "good" results are found and returned.
- How could the "quality" or "goodness" of each find be evaluated in order to return in top-k order?
- Yes :: Would you like the GUI to display a Bar Chart or Heatmap showing the quality scores of the documents as it sequences through them?
- Would the .json files be useful within a Neo4j graph database?
- Yes, if that is useful :: Are you planning to use Python's Neo4j driver to automate this ingestion as soon as the run_pageindex.py script finishes its local run?
- Yes :: Would you like to include a "Sync to Neo4j" button in your existing NiceGUI investigator?
- Yes :: Would you like to add a Visual Graph Preview component to the GUI using a library like pyvis to see the hierarchy as it syncs?
- New question about PageIndex. Does it handle PDF tables well?
- Both NPR 7150.2D and NASA-STD-8739.8B, as well as NASA SSP-50038 :: Do you have a specific NASA Standard in mind that is particularly heavy on tables, like NASA-STD-8739.8B?
- I am considering Nvidia nemotron-cascade-2 (30B, MOE with 3B active), Google Gemma4 (31B), IBM Granite4:32b-a9b-h, and GPT-OSS (20B)
#4345 If I were to subscribe to PageIndex for $30 for 1-month and 1000 page credits, where do my ingested documents reside, and can I get to them next month after cancelling my subscription?
With your background in NASA S&MA and high-performance hardware, you'll likely appreciate the architectural distinction between the local open-source version of PageIndex and their hosted Cloud Service.
If you subscribe to the Standard Developer Plan ($30/mo), here is exactly how your data is handled:
1. Where Your Documents Reside
When using the hosted Cloud Service (the web platform or hosted API):
Infrastructure: Your ingested documents and the resulting "Tree Index" (the hierarchical TOC and summaries) reside on PageIndex’s AWS infrastructure.
Storage Logic: Unlike traditional RAG that stores vector embeddings, PageIndex stores the structural map of your document.
According to their documentation, "Indexed pages stay available for querying anytime." Security: Data is encrypted at rest using AES-256 and in transit via TLS 1.2+.
2. Post-Cancellation Access
If you cancel your subscription next month, your access typically changes as follows:
Persistence: The pages you already "bought" with your 1,000 credits are considered processed. The system generally keeps these indices available in your account, but your ability to query them through the hosted LLM or API may be restricted to the "Free Tier" limits (e.g., a lower number of chat messages or active pages).
Active Pages Limit: The Standard plan allows up to 10,000 active pages.
If you cancel and revert to a Free tier (which has a lower "active page" cap, often around 200), you may lose the ability to query the surplus documents until you resubscribe or delete others to fit under the cap. Top-Up Credits: If you purchased any "Top-Up" credits beyond the 1,000 included in the plan, those never expire and carry over even if you aren't on a paid monthly plan.
3. The "Veteran Professional" Alternative
Since you already operate high-spec hardware (like the Grace Blackwell MSI and the Ryzen AI Max+ Mini PC) and value offline privacy for proprietary data:
Self-Hosting: You can bypass the subscription and storage concerns entirely by using their open-source repository.
Local Control: You can run the
run_pageindex.pyscripts locally. In this scenario, the documents and indices stay on your local NVMe drives, and you can point the system to your local Ollama or LiteLLM endpoints to keep the entire reasoning chain offline.
Quick Summary Table
| Feature | Hosted Standard ($30) | Post-Cancellation (Cloud) | Local/Self-Hosted |
| Storage | PageIndex AWS | PageIndex AWS | Your Local Hardware |
| Credits | 1,000 monthly | 0 (resets to Free tier) | Unlimited (Local LLM) |
| Access | Up to 10k pages | Reduced to Free limits | Unlimited |
Given your work with NASA standards and the "Digital Mission Twin," are you planning to integrate these document trees into your existing Neo4j knowledge graph?
How do I install PageIndex locally, requiring no PageIndex API Key or subscription?
Since you are operating high-end hardware like the MSI EdgeXpert (Grace Blackwell) and the Ryzen AI Max+ Mini PC, you are perfectly positioned to run PageIndex locally.
To run it without a subscription or a PageIndex API key, you essentially shift from their SaaS (hosted) client to their Open Source repository. This allows you to use your local Ollama or LiteLLM instances as the "brain" for both the tree generation and the reasoning retrieval.
1. Prerequisites
Ollama: Installed and running on your local machine.
Models: You will need a capable reasoning model (e.g.,
llama3.1orgranite-3.0-8b-instruct) and, if you want visual RAG, a vision model (e.g.,llama3-visionorgranite-vision).Python: 3.10 or higher.
2. Installation Steps
Clone the official repository and set up your environment:
# Clone the repository
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
# Install dependencies
pip install -r requirements.txt
pip install litellm # Recommended for handling local model routing
3. Configuration (The "No API Key" Trick)
PageIndex uses LiteLLM under the hood, which allows you to redirect requests meant for OpenAI/Anthropic to your local Ollama server.
Create a
.envfile in the root directory.Set a dummy key for the PageIndex client (since the local script expects the variable but won't check it against a server if you point to a local model).
Map your local Ollama endpoint:
# .env file
PAGEINDEX_API_KEY=local-dev-no-key
OLLAMA_API_BASE=http://localhost:11434
4. Running the Local Ingestion
To generate the "Tree Index" (the smart TOC) without sending data to their cloud, run the run_pageindex.py script and specify your local model via the --model flag.
# Use Ollama via LiteLLM syntax (ollama/<model-name>)
python3 run_pageindex.py \
--pdf_path "/path/to/your/NASA_Standard_8739.pdf" \
--model "ollama/granite-3.0-8b-instruct" \
--max-pages-per-node 10
5. Why this suits your NASA S&MA Work
Data Sovereignty: Because you aren't using their
PageIndexClient(api_key=...)in a hosted script, your proprietary NASA docs never leave your Grace Blackwell architecture.Vectorless Efficiency: Since your hardware has significant VRAM (12GB on the laptop, likely much more on the Blackwell), you can hold the entire "Tree Index" JSON in the context window.
Traceability: The local output includes a "Reasoning Trace," showing exactly which sections of the document the LLM decided to read—critical for the audit trails required in Safety and Mission Assurance.
Summary of Local vs. Subscription
| Feature | $30 Subscription | Local Installation |
| Data Residency | PageIndex AWS | Your Local NVMe |
| API Key Required | Yes | No (Redirect to Ollama) |
| Max Documents | 1,000 credits/mo | Unlimited |
| Privacy | Standard Cloud | Air-gapped capable |
Are you planning to run this primarily on the MSI EdgeXpert to leverage the Blackwell architecture, or are you looking to keep this as a portable utility on the Ryzen Mini PC?
Would "pip install pageindex==0.3.0.dev1" be sufficient, or do I still need to download the GIT repository?
Installing via pip install pageindex==0.3.0.dev1 is a great start, but for your specific goal—running locally with no API key or subscription—it is likely not sufficient on its own.
Here is the breakdown of why you still need the Git repository and how the pieces fit together for your high-performance hardware setup.
1. The SDK vs. The Engine
The SDK (
pip install): The package you see on PyPI is primarily a Client SDK.It is designed to talk to the PageIndex Cloud API. Even the 0.3.0.dev1version (released April 2026) is architected to send your documents to their servers for processing and then receive the "Tree Index" back.The Engine (Git Repo): To process documents locally using your own compute (like your Grace Blackwell MSI), you need the actual logic that builds the tree—the scripts that handle the PDF parsing, the recursive LLM summarization, and the local tree-traversal. This code currently lives in the scripts and utilities of the
.VectifyAI GitHub repository
2. Why the Git Repo is necessary for "Local Only"
To avoid the subscription and API key, you are essentially "self-hosting" the indexing process. The Git repository contains:
run_pageindex.py: This is the core script that performs local ingestion.The PyPI package generally does not expose this as a CLI command. Local Backend Logic: The logic to interface with Ollama or LiteLLM directly (without a middleman API) is most easily configured within the repo's environment.
Requirements: Local processing requires specific dependencies (like
PyMuPDFormarker-pdf) that the slim SDK might not include by default.
3. The Recommended Hybrid Setup
Since you are a veteran software professional, I recommend this "clean" approach to get the best of both worlds:
Clone the Repo: This gives you the local "builder" tools.
git clone https://github.com/VectifyAI/PageIndex.git cd PageIndexInstall the SDK within the Repo: This ensures you have the latest classes for handling the JSON trees once they are built.
pip install pageindex==0.3.0.dev1 pip install -r requirements.txtUse the Local Script: Point it to your local Ollama instance.
python run_pageindex.py --pdf_path "path/to/nasa-doc.pdf" --model "ollama/granite-3.0-8b-instruct"
4. Important Privacy Note
By using the Git repository's run_pageindex.py script and directing the --model flag to a local endpoint, you bypass the PageIndexClient class that checks for a subscription. This keeps your NASA S&MA data strictly on your local NVMe drives.
Pro-Tip for your Hardware: Since you have 12GB VRAM on your laptop and significantly more on the Blackwell system, ensure you set the --max-tokens-per-node higher than the default in the script. Your hardware can handle much larger context windows, which will make the "Tree Index" even more accurate for complex technical requirements.
So when a document is indexed using PageIndex, does the document remain in real-time memory only, or is it save to a specified file store?
Given your focus on "idea mining" and creating persistent storage for projects like CogniWeave, the way PageIndex handles data will be a familiar pattern to you.
When you run PageIndex locally, it does not rely on real-time memory alone. It follows a structured persistence model that saves your work to your local file system.
1. The Primary Storage: The "Tree Index" JSON
When you run the run_pageindex.py script on your local hardware, it creates a persistent JSON file that acts as your "map."
Location: By default, these are saved in a
./results/directory within the PageIndex folder.Naming Convention: Usually
{your_filename}_structure.json.What’s inside: This file contains the hierarchical "Tree" (nodes, titles, page ranges, and summaries). If you have set the flag
--if-add-node-text yes, it will also store the actual text content of those sections within the JSON.
2. The Retrieval Logic: Memory vs. Disk
Navigation Phase: When you ask a question, the system loads that JSON Tree into the LLM's active context window (memory). Because the tree is just a summary of the structure, it is small enough to fit in the VRAM of your RTX 4080 or Grace Blackwell system.
Extraction Phase: Once the LLM "reasons" which node it needs, the system goes back to the local PDF/file store or the JSON file (if text was included) to pull the full text for the final answer.
3. Integration with Your Knowledge Graph
Since you are already working with Neo4j and Cypher, you might find the PageIndex storage model particularly useful. Because the output is a structured JSON hierarchy, you can easily write a Python script to ingest that JSON directly into Neo4j:
| PageIndex Component | Neo4j Equivalent |
| Root Node | :Document Node |
| Child Nodes | :Section or :Subsection Nodes |
| Parent-Child Link | [:HAS_SUBSECTION] Relationship |
| Node Summary | Property on the Node |
Summary for your local R&D:
Persistent? Yes. It saves a
.jsonfile to your disk.Real-time Memory? Only during the active "reasoning" step (loading the map).
Privacy? Since the file resides in your
./results/folder, it never leaves your machine.
Pro-Tip: If you are using this for your Digital Mission Twin, you can store the PageIndex JSONs alongside your source PDFs in a dedicated project folder. You don't need a database to keep the index alive; you just need to point your chat script to that specific JSON file whenever you want to "talk" to that document again.
How do you envision linking these hierarchical JSON trees to your existing "Cortex" intent-based database?
Describe a Python application with GUI to select any one or multiple of already indexed documents for query
For your "MAKER" workflow and NASA S&MA requirements, a local GUI provides the traceability you need without the overhead of a cloud subscription. Since PageIndex saves its results as structured JSON files in the ./results/ directory, the application needs to scan that folder, let you select files, and then use your local Ollama instance to "reason" over those trees.
The following Python application uses NiceGUI (a high-level, Pythonic web-framework for GUIs) because it handles asynchronous LLM calls gracefully and allows for the "chat-style" interface you'll want for querying.
Python Application: PageIndex Local Explorer
import os
import json
import asyncio
from nicegui import ui
import httpx # To talk to your local Ollama API
# --- CONFIGURATION ---
RESULTS_DIR = "./results" # Default PageIndex output folder
OLLAMA_URL = "http://localhost:11434/api/generate"
LOCAL_MODEL = "granite-3.0-8b-instruct" # Or your preferred model
class PageIndexExplorer:
def __init__(self):
self.selected_files = []
self.chat_history = []
self.indices = {} # Stores loaded JSON content
def get_indexed_files(self):
"""Scans the results directory for PageIndex JSON files."""
if not os.path.exists(RESULTS_DIR):
return []
return [f for f in os.listdir(RESULTS_DIR) if f.endswith("_structure.json")]
async def run_query(self, user_query):
"""Logic to query the selected JSON trees via local Ollama."""
if not self.selected_files:
ui.notify("Please select at least one document first!", type='warning')
return
# 1. Prepare Context (Combining summaries from all selected trees)
combined_context = ""
for file_name in self.selected_files:
with open(os.path.join(RESULTS_DIR, file_name), 'r') as f:
data = json.load(f)
# We extract titles and summaries for 'reasoning-based' retrieval
combined_context += f"Document: {file_name}\n"
combined_context += json.dumps(data, indent=2)[:5000] # Cap context for safety
# 2. Build Prompt
prompt = f"""
You are a NASA Safety and Mission Assurance assistant.
Using the following Document Tree Structure(s), answer the query.
Context:
{combined_context}
Query: {user_query}
"""
# 3. Call Local Ollama (Grace Blackwell / Ryzen hardware handles this)
async with httpx.AsyncClient() as client:
response = await client.post(OLLAMA_URL, json={
"model": LOCAL_MODEL,
"prompt": prompt,
"stream": False
}, timeout=60.0)
result = response.json().get('response', 'No response from local model.')
self.chat_history.append(("You", user_query))
self.chat_history.append(("AI", result))
self.update_chat.refresh()
@ui.refreshable
def update_chat(self):
for role, text in self.chat_history:
ui.label(f"{role}:").style('font-weight: bold')
ui.markdown(text).classes('mb-4 p-2 bg-slate-100 rounded')
# --- GUI LAYOUT ---
explorer = PageIndexExplorer()
with ui.header().classes('items-center justify-between'):
ui.label('PageIndex Local Explorer').classes('text-2xl')
ui.label('NASA S&MA Workflow').classes('text-sm italic')
with ui.left_drawer(value=True).classes('bg-slate-50'):
ui.label('Available Indices').classes('text-lg mb-2')
files = explorer.get_indexed_files()
ui.select(files, multiple=True, label='Select Documents',
on_change=lambda e: setattr(explorer, 'selected_files', e.value)).classes('w-full')
ui.button('Refresh List', on_change=explorer.get_indexed_files)
with ui.column().classes('w-full max-w-3xl mx-auto'):
chat_container = ui.column().classes('w-full border-b mb-4 min-h-[400px]')
with chat_container:
explorer.update_chat()
with ui.row().classes('w-full items-center'):
query_input = ui.input(placeholder='Ask a question about the selected docs...').classes('flex-grow')
ui.button('Send', on_click=lambda: explorer.run_query(query_input.value))
ui.run(title="PageIndex Explorer", port=8080)
Why this design works for you:
Zero API Keys: It points directly to
localhost:11434(Ollama), keeping your data on your MSI/Ryzen hardware.Multi-Doc Querying: The
ui.select(multiple=True)allows you to combine indices. This is perfect for your Digital Mission Twin where you might need to query the HLS and Artemis indices simultaneously.Scalability: Because PageIndex stores the "Tree" (not the full text, unless you use the
--if-add-node-text yesflag), you can load dozens of document maps into memory without overwhelming the LLM's context window.Persistence: It reads directly from
./results/, so everything you indexed yesterday is immediately available for selection today.
Implementation Note: Since you mentioned 40 years of experience, you'll likely want to extend the run_query function to implement the Tree Search logic (first asking the LLM which node IDs it needs, then fetching the raw text from those IDs). The code above provides the "Reasoning-over-Summary" foundation to get you started.
Does this GUI structure align with how you intended to manage your project documentation for the Gateway Program?
Rewrite using the Pypi Ollama library
Using the official ollama Python library makes the code much cleaner and handles the connection to your local server more robustly.
Since you're working on high-performance local hardware, this version assumes Ollama is running on its default port. I have also added a "System Message" to the chat logic to better suit your role as a NASA S&MA contractor.
PageIndex Local Explorer (Ollama Library Version)
import os
import json
from nicegui import ui
from ollama import Client # Official Pypi Ollama Library
# --- CONFIGURATION ---
RESULTS_DIR = "./results"
LOCAL_MODEL = "granite-3.0-8b-instruct" # Optimized for your hardware
# Initialize the official client
ollama_client = Client(host='http://localhost:11434')
class PageIndexExplorer:
def __init__(self):
self.selected_files = []
self.chat_history = []
def get_indexed_files(self):
"""Scans the local ./results folder for PageIndex JSON trees."""
if not os.path.exists(RESULTS_DIR):
return []
return [f for f in os.listdir(RESULTS_DIR) if f.endswith("_structure.json")]
def run_query(self, user_query):
"""Processes the query using the official Ollama library."""
if not self.selected_files:
ui.notify("Select at least one document first!", type='warning')
return
# 1. Aggregate Context from selected JSON trees
context_blocks = []
for file_name in self.selected_files:
try:
with open(os.path.join(RESULTS_DIR, file_name), 'r') as f:
data = json.load(f)
# We pass the structural tree to the LLM
context_blocks.append(f"DOCUMENT: {file_name}\nTREE DATA:\n{json.dumps(data, indent=1)}")
except Exception as e:
ui.notify(f"Error loading {file_name}: {e}", type='negative')
context_string = "\n---\n".join(context_blocks)
# 2. Execute Local Inference via Ollama Library
try:
# Using .chat() allows for cleaner system-role prompting
response = ollama_client.chat(model=LOCAL_MODEL, messages=[
{
'role': 'system',
'content': 'You are a NASA Safety & Mission Assurance (S&MA) technical lead. Use the provided PageIndex Tree Structures to answer queries accurately.'
},
{
'role': 'user',
'content': f"Context:\n{context_string}\n\nQuery: {user_query}"
},
])
answer = response['message']['content']
self.chat_history.append(("You", user_query))
self.chat_history.append(("AI", answer))
self.update_chat.refresh()
except Exception as e:
ui.notify(f"Ollama Error: {str(e)}", type='negative')
@ui.refreshable
def update_chat(self):
for role, text in self.chat_history:
with ui.card().classes('w-full mb-2'):
ui.label(role).classes('text-xs font-bold text-blue-600')
ui.markdown(text)
# --- GUI LAYOUT ---
explorer = PageIndexExplorer()
with ui.header().classes('bg-blue-900 text-white items-center'):
ui.icon('rocket_launch').classes('text-2xl')
ui.label('PageIndex Local Investigator').classes('text-xl font-bold')
with ui.left_drawer().classes('bg-slate-100'):
ui.label('Select Project Indices').classes('font-bold mb-4')
files = explorer.get_indexed_files()
ui.select(files, multiple=True, label='Selected Documents',
on_change=lambda e: setattr(explorer, 'selected_files', e.value)).classes('w-full')
ui.button('Clear Chat', on_click=lambda: (explorer.chat_history.clear(), explorer.update_chat.refresh())).classes('mt-4 w-full')
# Main Chat Area
with ui.column().classes('w-full max-w-4xl mx-auto p-4'):
chat_display = ui.column().classes('w-full')
with chat_display:
explorer.update_chat()
with ui.row().classes('w-full items-center p-4 bg-white border-t sticky bottom-0'):
query_input = ui.input(placeholder='Ask about NASA standards or Mission Twins...').classes('flex-grow')
ui.button(icon='send', on_click=lambda: explorer.run_query(query_input.value))
ui.run(title="PageIndex Explorer", native=True, window_size=(1200, 800))
Key Technical Improvements:
The
ollamaLibrary: Usesollama_client.chat(), which is more structured than a raw HTTP post and manages the connection pool more efficiently for your Grace Blackwell MSI.System Prompting: I’ve injected a system role that specifically references NASA S&MA, which will help the model prioritize technical precision and mission safety in its summaries.
Native Mode: The
ui.run(native=True)flag will attempt to open the app in its own window (like a standard desktop application) rather than a browser tab, fitting your "MAKER" standalone tool preference.Selection Persistence: You can now select multiple mission-specific indices (e.g., Gateway and Artemis) and query them as a single unified knowledge source.
One caution for your 12GB VRAM laptop: If you select many documents at once, the context_string might grow large. Since you're using IBM Granite or Llama 3.1, you can comfortably fit about 8k–32k tokens. If your trees get very deep, you might want to truncate the JSON data or use a more specific retrieval step.
Would you like to add a feature to this GUI that automatically generates a Cypher script based on the selected document's tree structure?
Would it work to extend this application to support user queries to an indexed document, then use that retrieved text as part of a prompt to assess a separate indexed document via LLM?
Extending your application to use one document as a "lens" to assess another is exactly how a NASA S&MA auditor would operate (e.g., using a Requirement Standard to audit a Hazard Report).
This would work exceptionally well because PageIndex isn't just a search engine; it's a navigational map. You can use the LLM to browse "Document A," extract specific requirements, and then use those requirements as the "intent" to navigate "Document B."
How the Workflow Operates (The "Requirement-to-Compliance" Chain)
Step 1 (Extraction): The GUI queries the
Standard_8739_structure.json. The LLM returns the specific text for Section 4.3.1 (Software Safety).Step 2 (Cross-Reference): You take that extracted text and inject it into a new prompt: "Using these Software Safety requirements, find the most relevant sections in the Mission_Hazard_Report.json and assess if they comply."
Step 3 (Reasoned Retrieval): The LLM looks at the Hazard Report tree, picks the right nodes (e.g., "Mitigation Strategies"), and pulls that text to perform the final assessment.
Updated Code Strategy
To make this work, you need to ensure your indexing step used the --if-add-node-text yes flag. This ensures the raw text is inside your JSON, so you don't have to re-parse the PDF during the "Assessment" phase.
Here is the Python logic to extend your explorer for this "Two-Document" audit:
def audit_documents(self, source_doc_id, target_doc_id, query):
"""
source_doc: The 'Rulebook' (e.g., NASA Standard)
target_doc: The 'Subject' (e.g., Project Design Doc)
"""
# 1. Extract the 'Rule' from the first document
# We use the existing tree logic to find the specific requirement
rule_extraction = ollama_client.chat(model=LOCAL_MODEL, messages=[
{'role': 'user', 'content': f"Extract the specific requirements for '{query}' from this tree: {self.indices[source_doc_id]}"}
])
extracted_rule = rule_extraction['message']['content']
# 2. Use the 'Rule' to audit the second document
audit_result = ollama_client.chat(model=LOCAL_MODEL, messages=[
{
'role': 'system',
'content': 'You are a NASA Safety Auditor. Compare the provided RULE to the SUBJECT document and identify gaps.'
},
{
'role': 'user',
'content': f"RULE: {extracted_rule}\n\nSUBJECT TREE: {self.indices[target_doc_id]}\n\nPerform the audit."
}
])
return audit_result['message']['content']
Why this fits your "MAKER" Profile
Traceability: Because you are using PageIndex, your "Audit Report" can include the Node IDs from both documents. (e.g., "Requirement from Node N004 of Doc A is not met by Mitigation in Node N082 of Doc B").
Context Window Management: Instead of shoving two massive 500-page PDFs into your MSI’s VRAM, you are only passing the summarized trees. This allows you to perform complex cross-document reasoning even on a laptop.
Digital Mission Twin: This is the "brain" of a Mission Twin. You can automate the check between Lunar Surface Operations (HLS) and the Gateway Interface Requirements (IRD) by simply chaining these queries.
Implementation Tip
When you run your initial indexing, use this command to ensure the text is available for the "Audit" logic:
python run_pageindex.py --pdf_path "Standard.pdf" --if-add-node-text yes --if-add-node-summary yes
This makes the JSON file slightly larger, but it turns your results folder into a fully-functional, queryable database that requires zero external API calls or vector database management.
Does this "Comparison Audit" workflow cover what you need for your NASA subcontracting reports?
A useful option would be to select a set of documents, prioritized for search order, then sequence through each document until "good" results are found and returned.
This is a sophisticated "Short-Circuit" search pattern. In a high-stakes environment like NASA S&MA, this prevents the LLM from hallucinating an answer based on a secondary document when the primary "Source of Truth" already contains a definitive "No" or a specific "Shall" requirement.
By leveraging your local Ollama setup, you can implement a "Waterfall Retrieval" logic. The application will query each document’s Tree Index one by one. If the LLM determines it has found a "Good" (sufficiently high-confidence) answer, it stops and returns the result, saving VRAM and compute time on your MSI or Ryzen hardware.
The "Waterfall" Query Logic
Here is how you can modify the PageIndexExplorer class to handle prioritized sequencing:
import os
import json
from ollama import Client
ollama_client = Client(host='http://localhost:11434')
class PageIndexExplorer:
# ... previous init code ...
async def prioritized_search(self, prioritized_docs, user_query):
"""
Sequences through documents in order.
Stops as soon as a 'Good' result is found.
"""
final_answer = None
for doc_name in prioritized_docs:
ui.notify(f"Searching {doc_name}...", type='info')
# Load the tree for the current document
with open(os.path.join("./results", doc_name), 'r') as f:
tree_data = json.load(f)
# Ask the local model to evaluate IF the answer is in THIS document
# We use a 'Judge' prompt to determine if we should stop here
response = ollama_client.chat(model="granite-3.0-8b-instruct", messages=[
{
'role': 'system',
'content': (
"You are a NASA S&MA Auditor. Examine the provided Document Tree. "
"1. If the answer to the query is likely in this document, provide the answer and start your response with [FOUND]. "
"2. If the document does not contain the information, respond with [NOT_FOUND]."
)
},
{
'role': 'user',
'content': f"Query: {user_query}\n\nDocument Tree:\n{json.dumps(tree_data, indent=1)}"
}
])
result_text = response['message']['content']
if "[FOUND]" in result_text:
final_answer = result_text.replace("[FOUND]", "").strip()
ui.notify(f"Match found in {doc_name}!", type='positive')
break # This is the "Short-Circuit"
else:
ui.notify(f"Insufficient data in {doc_name}, moving to next...", type='warning')
if not final_answer:
final_answer = "The query could not be satisfied by any of the selected documents."
return final_answer
GUI Implementation Strategy
To support Prioritization, you need a way to reorder the selected documents. In NiceGUI, you can use a ui.list with "Up/Down" arrows or a drag-and-drop interface.
Recommended UI Layout:
Column A (Available): A list of all
_structure.jsonfiles.Column B (Search Queue): A list where you add documents. Each item has a
↑and↓button to set the search priority.Search Button: Triggers the
prioritized_searchloop.
Why this benefits your S&MA Workflow:
Hierarchy of Authority: You can place NPR 7150.2D (Software Engineering Requirements) at the top of the list. If the answer is there, the system won't bother looking at lower-level project design docs, ensuring your reports always align with the highest-level requirements first.
Resource Efficiency: Since you are running locally, why process five 1,000-page indices if the answer is on page 10 of the first one? This "Waterfall" approach drastically reduces the load on your RTX 4080's memory.
Audit Trail: The app can log the "failed" searches as well, which is useful for NASA audits: "Checked Document A (No Match), Checked Document B (No Match), Found definitive requirement in Document C."
Does this "short-circuit" logic fit the specific way you evaluate Hazard Reports against NASA Safety Standards?
How could the "quality" or "goodness" of each find be evaluated in order to return in top-k order?
To evaluate "goodness" for a top-k ranking on your local hardware, you need a multi-stage scoring heuristic. In a NASA S&MA context, a "good" result isn't just one that contains keywords; it’s one that matches the technical intent and authority level of your query.
Since you are bypassing the PageIndex cloud and using your own Grace Blackwell/Ryzen compute, you can implement a "Scoring Agent" workflow.
1. The Scoring Heuristic
Instead of a simple boolean [FOUND], you can ask your local LLM to return a structured Confidence Score based on specific criteria. I recommend a 1–10 scale based on three pillars:
Relevance: Does the node content directly address the query?
Specificity: Does it provide a "Shall" statement or a specific value (e.g., "180°C"), or is it just a general heading?
Authority: Is this a high-level requirement (NPR) or a lower-level implementation note?
2. Implementation: The "Scorer" Function
You can extend your Python application to run a "Scoring Pass" on each candidate node.
def evaluate_result_quality(query, candidate_text, source_doc):
"""
Uses the local LLM to assign a numerical score to a retrieved snippet.
"""
score_prompt = f"""
Evaluate the 'Goodness' of the following text snippet relative to the User Query.
User Query: {query}
Source Document: {source_doc}
Snippet: {candidate_text}
Return ONLY a JSON object with:
1. "score": (Integer 1-10)
2. "reason": (Brief explanation)
3. "confidence": (High/Med/Low)
"""
response = ollama_client.generate(
model="granite-3.0-8b-instruct",
prompt=score_prompt,
format="json" # Granite and Llama 3.1 support JSON mode
)
return json.loads(response['response'])
3. The Top-K Collection Workflow
Instead of stopping at the first find (Waterfall), you sequence through your documents and collect all candidates with a score above a certain threshold (e.g., > 7).
Iterate: Loop through your prioritized JSON trees.
Extract: Pull the top 3 most relevant nodes from each tree.
Score: Run the
evaluate_result_qualityfunction on each of those 3 nodes.Rank: Store these in a list of dictionaries:
results.append({'doc': name, 'score': s, 'text': t}).Sort:
results.sort(key=lambda x: x['score'], reverse=True).Return: Display the Top-K (e.g., Top 3) results to the GUI.
4. Hardware Optimization for S&MA
Since you are a veteran software professional, you know that "Quality" in Safety and Mission Assurance often requires cross-verification.
| Tier | Method | Resource Cost | Quality Level |
| Tier 1 | String/Keyword Match | Negligible | Low (Too much noise) |
| Tier 2 | LLM Tree Navigation | Medium | High (Standard PageIndex) |
| Tier 3 | LLM Self-Reflection | High | Superior (The Audit Gold Standard) |
The Tier 3 "Self-Reflection" Approach:
With the VRAM on your MSI EdgeXpert, you can actually run a "Critic" agent simultaneously. After the first model finds an answer, a second model (perhaps a larger one like llama3.1:70b if your Blackwell allows) reviews the answer and says: "This result from NASA-STD-8739.8B is high quality because it references the Mapping Matrix explicitly."
Why this works for your "Digital Mission Twin":
If you are querying your Neo4j graph later, these "Goodness" scores can be stored as properties on the relationships. You can literally track how "authoritative" a specific requirement link is, allowing you to filter your mission twin by "High Confidence" safety paths only.
Would you like the GUI to display a Bar Chart or Heatmap showing the quality scores of the documents as it sequences through them?
Yes
Adding a visual "Quality Heatmap" to your local PageIndex investigator transforms it from a simple chat interface into a professional R&D dashboard.
For a veteran software professional and NASA subcontractor, this provides immediate visual validation of which mission documents are providing the "Source of Truth" for your S&MA reports.
The Design: "Live Scoring Heatmap"
We will use NiceGUI's integration with Plotly or its built-in Linear Progress bars to show the "Goodness" score of each document as the sequence runs.
import os
import json
import asyncio
from nicegui import ui
from ollama import Client
# Initialize Client for your local hardware
ollama_client = Client(host='http://localhost:11434')
RESULTS_DIR = "./results"
LOCAL_MODEL = "granite-3.0-8b-instruct"
class QualityHeatmapExplorer:
def __init__(self):
self.doc_scores = {} # {doc_name: score_0_to_1}
self.top_k_results = []
self.is_searching = False
def get_indices(self):
return [f for f in os.listdir(RESULTS_DIR) if f.endswith("_structure.json")]
async def evaluate_top_k(self, selected_docs, query, k=3):
self.is_searching = True
self.doc_scores = {doc: 0.0 for doc in selected_docs}
self.top_k_results = []
self.update_ui.refresh()
all_candidates = []
for doc in selected_docs:
with open(os.path.join(RESULTS_DIR, doc), 'r') as f:
tree_data = json.load(f)
# Stage 1: Assessment & Scoring
prompt = f"""
Analyze this Document Tree for the query: "{query}"
Return a JSON object with:
- "score": (0.0 to 1.0) based on relevance to NASA S&MA standards.
- "summary": (The answer found in this specific doc)
"""
try:
# Using 'format="json"' ensures your hardware returns parseable data
resp = ollama_client.generate(model=LOCAL_MODEL, prompt=prompt, format="json")
data = json.loads(resp['response'])
score = data.get('score', 0.0)
self.doc_scores[doc] = score
all_candidates.append({
'doc': doc,
'score': score,
'text': data.get('summary', 'No summary provided.')
})
self.update_ui.refresh()
await asyncio.sleep(0.1) # UI Heartbeat
except:
continue
# Sort and take Top-K
all_candidates.sort(key=lambda x: x['score'], reverse=True)
self.top_k_results = all_candidates[:k]
self.is_searching = False
self.update_ui.refresh()
@ui.refreshable
def update_ui(self):
# 1. The Heatmap (Progress Bars)
with ui.card().classes('w-full p-4'):
ui.label('Document Relevance Heatmap').classes('text-lg font-bold mb-2')
for doc, score in self.doc_scores.items():
with ui.row().classes('items-center w-full'):
ui.label(doc[:30] + '...').classes('w-1/3 text-xs truncate')
# Color shifts from Blue (low) to Green (high)
color = 'green' if score > 0.7 else 'orange' if score > 0.4 else 'blue'
ui.linear_progress(value=score, show_value=False).props(f'color={color} track-color=grey-3')
ui.label(f"{int(score*100)}%").classes('text-xs w-8')
# 2. Top-K Results
if self.top_k_results:
ui.label(f'Top {len(self.top_k_results)} Verified Matches').classes('text-lg font-bold mt-4')
for res in self.top_k_results:
with ui.card().classes('w-full mb-2 border-l-4 border-green-500'):
with ui.row().classes('justify-between w-full'):
ui.label(res['doc']).classes('font-bold text-blue-800')
ui.badge(f"Score: {int(res['score']*100)}").props('color=green')
ui.markdown(res['text'])
# --- APP LAYOUT ---
explorer = QualityHeatmapExplorer()
with ui.header().classes('bg-slate-900'):
ui.label('NASA S&MA Digital Mission Twin: Document Investigator').classes('text-xl')
with ui.left_drawer().classes('bg-slate-50'):
ui.label('1. Priority Sequence').classes('font-bold')
docs = explorer.get_indices()
sel = ui.select(docs, multiple=True, label='Selected (In Order)').classes('w-full')
ui.label('2. Search Configuration').classes('font-bold mt-4')
k_val = ui.number('Top-K', value=3, min=1, max=10).classes('w-full')
q_input = ui.input('User Query').classes('w-full mt-4')
ui.button('Execute Quality Audit', on_click=lambda: explorer.evaluate_top_k(sel.value, q_input.value, int(k_val.value))).classes('w-full mt-4')
with ui.column().classes('w-full max-w-5xl mx-auto'):
explorer.update_ui()
ui.run(title="Quality Investigator", port=8081)
Why this setup fits your profile:
Visual Confidence: The color-coded progress bars (Blue → Orange → Green) allow you to see exactly which parts of your library "lit up" during the search.
Audit Logic: By using the
format="json"parameter, you’re forcing the local LLM to behave like a structured data provider, which is essential for the deterministic needs of NASA Safety and Mission Assurance.Grace Blackwell Efficiency: Because this app sequences through documents rather than loading them all at once, you maintain high performance on your laptop even with very large PDFs.
The "Maker" Utility: Since this is a standalone script, you can easily point it at a new project directory for your Digital Mission Twin and have a functional dashboard in seconds.
Would you like to add an "Export Audit Log" button that saves these scores and the AI reasoning to a CSV for your subcontractor reports?
Would the .json files be useful within a Neo4j graph database?
As a veteran in software process engineering and Knowledge Graph development, you will find that these PageIndex .json files are essentially pre-structured graph schemas waiting to be imported.
Because PageIndex parses documents into a hierarchical tree (Root → Section → Subsection → Page), it has already done the heavy lifting of "Chunking" and "Relationship Mapping" that usually makes RAG difficult.
1. The Mapping Strategy
You can map the PageIndex JSON structure directly to a Neo4j property graph using this logic:
| JSON Key/Structure | Neo4j Element | Purpose |
| node_id | Node ID | Unique identifier for the section. |
| title | Property (n.title) | Searchable header for the section. |
| Parent/Child nesting | Relationship (:HAS_SUBSECTION) | Defines the hierarchical path of a requirement. |
| summary | Property (n.summary) | Allows for fast semantic search on the node. |
| page_range | Property (n.pages) | Traceability back to the physical NASA PDF. |
2. Why this is superior for your "Digital Mission Twin"
Unlike standard "Vector RAG," which often returns disjointed snippets, a Knowledge Graph built from PageIndex data allows for Structural Context:
Upward Traceability: You can write a Cypher query to find a specific safety requirement and then traverse up the graph to see which parent NASA Standard (NPR) mandated it.
Impact Analysis: If a high-level requirement in a
_structure.jsonchange, you can instantly see every "Child" node in your mission project that is affected.
3. Practical Cypher Import Example
Since you have 40 years of experience, you’ll recognize the efficiency of a single UNWIND statement. If you flatten the JSON to a list of nodes with parent IDs, you can import it like this:
// Import the hierarchical structure
UNWIND $nodes AS node
MERGE (s:Section {node_id: node.id, doc_source: $doc_name})
SET s.title = node.title,
s.summary = node.summary,
s.pages = node.page_range
// Establish the hierarchy
WITH node, s
MATCH (parent:Section {node_id: node.parent_id, doc_source: $doc_name})
MERGE (parent)-[:CONTAINS]->(s)
4. Integration with your "Cortex" Intent Language
Because PageIndex output is deterministic and structured, your Cortex language can treat the .json files as a middle-layer between raw natural language and the Neo4j database.
The Workflow: Cortex takes a user's intent → PageIndex identifies the relevant "Tree Nodes" → Neo4j visualizes the impact across the Mission Twin.
Summary of Benefits
No Hallucinated Structure: The graph reflects the actual table of contents of the document.
Hybrid Search: You can perform GraphRAG, using the LLM to traverse the Neo4j relationships you built from the PageIndex JSONs.
Hardware Synergy: Running the JSON-to-Neo4j ingestion on your MSI EdgeXpert will be instantaneous, creating a persistent, queryable "knowledge brain" that outlives any single LLM session.
Are you planning to use Python's Neo4j driver to automate this ingestion as soon as the run_pageindex.py script finishes its local run?
Yes, if that is useful
It is exceptionally useful, especially for your Digital Mission Twin and Gateway Program work. By automating the ingestion from PageIndex to Neo4j, you bridge the gap between "unstructured PDF text" and "structured systems engineering data."
Given your MSI EdgeXpert and Ryzen AI Max+ hardware, you can run this entire pipeline—indexing, scoring, and graph ingestion—completely offline.
1. The Integration Strategy
This script monitors your ./results folder. As soon as a PageIndex .json is generated, it parses the hierarchy and uses the Neo4j Python Driver to build the graph.
import json
import os
from neo4j import GraphDatabase
# --- CONFIGURATION ---
NEO4J_URI = "bolt://localhost:7687"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "your_password"
RESULTS_DIR = "./results"
class MissionTwinIngestor:
def __init__(self):
self.driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
def close(self):
self.driver.close()
def ingest_tree(self, json_path):
with open(json_path, 'r') as f:
data = json.load(f)
doc_name = os.path.basename(json_path).replace("_structure.json", "")
with self.driver.session() as session:
# 1. Create the Root Document Node
session.run("""
MERGE (d:Document {name: $name})
SET d.ingested_at = datetime()
""", name=doc_name)
# 2. Recursive function to build the tree in Cypher
self._process_node(session, data, doc_name, parent_id=None)
def _process_node(self, session, node, doc_name, parent_id):
# Create the Section Node
# Note: We use the summary and titles you've already generated locally
query = """
MERGE (s:Section {node_id: $node_id, document: $doc_name})
SET s.title = $title,
s.summary = $summary,
s.pages = $pages
WITH s
MATCH (d:Document {name: $doc_name})
MERGE (d)-[:HAS_ROOT_SECTION]->(s)
"""
# If there is a parent, link it
if parent_id:
query = """
MERGE (s:Section {node_id: $node_id, document: $doc_name})
SET s.title = $title,
s.summary = $summary,
s.pages = $pages
WITH s
MATCH (p:Section {node_id: $parent_id, document: $doc_name})
MERGE (p)-[:HAS_SUBSECTION]->(s)
"""
session.run(query,
node_id=node.get('node_id'),
title=node.get('title'),
summary=node.get('summary'),
pages=node.get('page_range'),
doc_name=doc_name,
parent_id=parent_id
)
# Recursively process children
for child in node.get('children', []):
self._process_node(session, child, doc_name, node.get('node_id'))
# --- EXECUTION ---
ingestor = MissionTwinIngestor()
# Example: Ingesting a specific mission standard
ingestor.ingest_tree(os.path.join(RESULTS_DIR, "NASA-STD-8739.json"))
ingestor.close()
2. Why this is a "Power Move" for S&MA
Structural Querying: You can now run Cypher queries that are impossible with standard RAG: "Find all subsections under 'Software Safety' that mention 'Verification' and show me their page ranges."
Relationship to "Cortex": Your intent-based language can now resolve a command like "Show me the safety tree for the LTV" by querying the
:HAS_SUBSECTIONrelationships rather than just doing a keyword search.Gap Analysis: You can import two documents (a Requirement and a Design) and create a manual or AI-assisted
:COMPLIES_WITHrelationship between nodes. This provides a visual audit trail for NASA leadership.
3. Hardware Synergy
Because Neo4j is memory-intensive for large graphs and PageIndex is compute-intensive for the LLM processing, your Grace Blackwell architecture handles this dual-load perfectly. You can keep the Neo4j database running in the background while the LLM populates it in real-time.
Final S&MA Benefit:
By storing the Node IDs from PageIndex in Neo4j, you create a "Permanent Address" for every paragraph in your mission library. Even if you cancel every AI subscription you have, your Neo4j graph remains a fully functional, navigable map of your technical requirements.
Would you like to include a "Sync to Neo4j" button in your existing NiceGUI investigator?
Yes
This is a powerful addition to your Digital Mission Twin stack. By integrating a "Sync to Neo4j" button, you transition from a temporary search session to a persistent, auditable knowledge base.
Given your NASA S&MA focus, this automation ensures that every "Good" find identified by your local Ollama models is immediately codified into your graph architecture on the MSI EdgeXpert.
Implementation: Integrated GUI with Neo4j Sync
This script extends your previous NiceGUI application. It adds the Neo4j driver logic and a synchronization action that parses the selected .json trees into your local graph database.
import os
import json
import asyncio
from nicegui import ui
from ollama import Client
from neo4j import GraphDatabase
# --- CONFIGURATION ---
RESULTS_DIR = "./results"
LOCAL_MODEL = "granite-3.0-8b-instruct"
NEO4J_CONFIG = {"uri": "bolt://localhost:7687", "auth": ("neo4j", "password")}
# Clients
ollama_client = Client(host='http://localhost:11434')
class MissionTwinApp:
def __init__(self):
self.selected_docs = []
self.doc_scores = {}
self.log = ""
def sync_to_neo4j(self):
"""Iterates through selected PageIndex JSONs and pushes to Neo4j."""
if not self.selected_docs:
ui.notify("No documents selected for sync!", type='warning')
return
driver = GraphDatabase.driver(**NEO4J_CONFIG)
try:
with driver.session() as session:
for doc_file in self.selected_docs:
ui.notify(f"Syncing {doc_file} to Graph...", type='info')
with open(os.path.join(RESULTS_DIR, doc_file), 'r') as f:
tree_data = json.load(f)
doc_id = doc_file.replace("_structure.json", "")
# Create Base Document Node
session.run("MERGE (d:Document {id: $id}) SET d.sync_date = datetime()", id=doc_id)
# Recursive ingestion of the Tree
self._ingest_recursive(session, tree_data, doc_id, None)
ui.notify("Neo4j Sync Complete!", type='positive')
except Exception as e:
ui.notify(f"Neo4j Error: {e}", type='negative')
finally:
driver.close()
def _ingest_recursive(self, session, node, doc_id, parent_id):
"""Maps PageIndex Tree nodes to Neo4j Section nodes."""
query = """
MERGE (s:Section {node_id: $node_id, doc_id: $doc_id})
SET s.title = $title, s.summary = $summary, s.pages = $pages
WITH s
"""
if parent_id:
query += "MATCH (p:Section {node_id: $parent_id, doc_id: $doc_id}) MERGE (p)-[:HAS_SUBSECTION]->(s)"
else:
query += "MATCH (d:Document {id: $doc_id}) MERGE (d)-[:HAS_ROOT]->(s)"
session.run(query,
node_id=node.get('node_id'),
doc_id=doc_id,
title=node.get('title'),
summary=node.get('summary', ''),
pages=node.get('page_range', ''),
parent_id=parent_id
)
for child in node.get('children', []):
self._ingest_recursive(session, child, doc_id, node.get('node_id'))
# --- GUI ---
app = MissionTwinApp()
with ui.header().classes('bg-blue-900 items-center'):
ui.icon('hub').classes('text-2xl')
ui.label('NASA S&MA: PageIndex → Neo4j Sync').classes('text-xl font-bold')
with ui.left_drawer().classes('bg-slate-50'):
ui.label('Project Library').classes('font-bold mb-2')
indices = [f for f in os.listdir(RESULTS_DIR) if f.endswith("_structure.json")]
sel = ui.select(indices, multiple=True, label='Select for Sync',
on_change=lambda e: setattr(app, 'selected_docs', e.value)).classes('w-full')
ui.separator().classes('my-4')
# The Sync Button
ui.button('Sync to Neo4j Graph', icon='cloud_upload',
on_click=app.sync_to_neo4j).classes('w-full bg-green-700 text-white')
ui.label('Ollama Status: Running').classes('text-xs mt-4 text-green-600')
with ui.column().classes('w-full max-w-4xl mx-auto p-4'):
with ui.card().classes('w-full'):
ui.label('Ingestion Log').classes('font-bold')
ui.markdown("Select documents in the sidebar to begin the **Digital Mission Twin** population.")
ui.run(title="Mission Twin Sync", port=8082)
Why this is the "Gold Standard" for your R&D:
Deterministic Traceability: Because PageIndex provides the
node_idandpage_range, your Neo4j graph isn't just a "vibe" of the document; it is a mathematically accurate map.Hybrid Reasoning: You can now query the graph for structural paths ("Find all hazards linked to Section 4.3") and then use Ollama to reason over the text found at those nodes.
The "Veteran" Advantage: By building this tool, you are creating a "Mission Memory" that is independent of any single AI provider. If PageIndex (the company) updates their schema or changes their pricing, your local Neo4j database remains a permanent asset.
Next Step for your "Cortex" Project:
Since you are developing Cortex as an intent-based language, you could now write a command like:
CORTEX: "Map NASA-STD-8739 to Gateway-Hazard-Report-001"
This command would trigger a script that pulls nodes from both documents in Neo4j and uses your local Blackwell/Ollama compute to find and create :RELATES_TO or :VIOLATES edges between them.
Would you like to add a Visual Graph Preview component to the GUI using a library like pyvis to see the hierarchy as it syncs?
Yes
Integrating a visual graph preview is the "capstone" for your Digital Mission Twin dashboard. By using Pyvis, you can generate an interactive, physics-based visualization of your NASA document hierarchies that renders directly within your NiceGUI interface.
This allows you to verify the "health" of your Knowledge Graph in real-time—seeing the "NPR" nodes branch out into specific "Shall" requirements across your Grace Blackwell system.
The Implementation: PageIndex Investigator + Visual Graph Preview
This version adds a Visualizer class that generates an HTML force-directed graph. Every time you sync a document, the graph updates to show the new nodes and relationships.
import os
import json
from nicegui import ui
from pyvis.network import Network
import tempfile
# --- CONFIGURATION ---
RESULTS_DIR = "./results"
class MissionGraphVisualizer:
def __init__(self):
self.net = Network(height='500px', width='100%', bgcolor='#ffffff', font_color='#333')
self.net.force_atlas_2based() # Stable physics for large technical trees
def build_from_json(self, json_path):
"""Converts PageIndex JSON hierarchy into a Pyvis Network."""
with open(json_path, 'r') as f:
data = json.load(f)
doc_name = os.path.basename(json_path).replace("_structure.json", "")
self.net.add_node(doc_id := f"DOC_{doc_name}", label=doc_name, color='#1e3a8a', size=30)
self._add_nodes_recursive(data, doc_id)
# Save to a temporary file for NiceGUI to display
tmp = tempfile.NamedTemporaryFile(delete=False, suffix='.html')
self.net.save_graph(tmp.name)
return tmp.name
def _add_nodes_recursive(self, node, parent_id):
node_id = node.get('node_id')
title = node.get('title', 'Untitled')
# S&MA Color Logic: Root sections are darker, subsections lighter
color = '#3b82f6' if len(node_id) < 5 else '#93c5fd'
self.net.add_node(node_id, label=title[:30], title=node.get('summary', ''), color=color)
self.net.add_edge(parent_id, node_id)
for child in node.get('children', []):
self._add_nodes_recursive(child, node_id)
# --- GUI ENHANCEMENT ---
vis = MissionGraphVisualizer()
with ui.header().classes('bg-slate-900'):
ui.icon('account_tree').classes('text-2xl text-blue-400')
ui.label('NASA Mission Twin: Structural Visualizer').classes('text-xl font-bold')
with ui.row().classes('w-full h-screen no-wrap'):
# Left Panel: Controls
with ui.column().classes('w-1/4 p-4 bg-slate-100 h-full border-r'):
ui.label('1. Select Mission Index').classes('font-bold')
indices = [f for f in os.listdir(RESULTS_DIR) if f.endswith("_structure.json")]
sel = ui.select(indices, label='Available Tree Indices').classes('w-full')
ui.button('Visualize Hierarchy', icon='visibility',
on_click=lambda: graph_display.set_source(vis.build_from_json(os.path.join(RESULTS_DIR, sel.value)))
).classes('w-full mt-4 bg-blue-700 text-white')
with ui.card().classes('mt-4 bg-blue-50 p-2'):
ui.label('Visual Legend').classes('text-xs font-bold')
ui.label('● Navy: Document Root').classes('text-xs text-blue-900')
ui.label('● Blue: Requirements').classes('text-xs text-blue-500')
# Right Panel: Interactive Graph
with ui.column().classes('w-3/4 p-4'):
ui.label('Interactive Knowledge Graph View').classes('text-lg font-bold')
# We use an iframe to host the Pyvis HTML
graph_display = ui.html('').classes('w-full h-[600px] border rounded shadow-inner')
def update_graph_source(self, path):
# Helper to refresh the iframe content
self.content = f'<iframe src="file://{path}" width="100%" height="100%" frameborder="0"></iframe>'
ui.run(title="Mission Twin Viz", port=8083)
Why this is critical for your "Maker" & NASA Workflow:
Structural Validation: Before you push to Neo4j, this allows you to visually confirm that the "Tree" generated by PageIndex correctly captured the hierarchy of the NASA Standard.
Contextual Tooltips: The Pyvis nodes store the LLM-generated summary in the
titleattribute. When you hover over a node in the GUI, you see the summary without needing to open the JSON.Low Latency: Since your MSI EdgeXpert doesn't have to talk to the cloud, the graph renders almost instantly as it processes the local JSON.
Audit Readiness: You can take a screenshot of this interactive graph for your S&MA reports to show exactly how a "Hazard Report" branches into specific "Safety Controls."
Future "Cortex" Integration:
Once this visualizer is running, you can add a feature where clicking a node in the graph triggers a Cortex Intent. For example:
Click Node N05 → Triggers:
CORTEX: "Find compliance gaps for this section in the Gateway Design Doc"Result: The graph highlights conflicting nodes in red.
Does this visual hierarchy help bridge the gap between your journaling in CogniWeave and the formal requirements of the Gateway Program?
New question about PageIndex. Does it handle PDF tables well?
The short answer for your NASA S&MA work: PageIndex handles PDF tables exceptionally well—specifically because it avoids the "Table-Killer" known as chunking.
Standard RAG often splits a table in half (e.g., the header is in one chunk, but the row you need is in the next), rendering both useless. PageIndex treats tables as structural landmarks rather than just a stream of characters.
1. How PageIndex "Sees" a Table
Instead of a mathematical vector search, PageIndex uses reasoning-based navigation. When it parses a document:
Layout Awareness: It identifies that a section is a "Table" or "Appendix" and preserves it as a single node in the hierarchy.
Context Preservation: If a requirement in a NASA Standard says "Refer to Table 5.2 for safety margins," PageIndex actually understands that link. Traditional RAG often misses this because the reference and the table aren't semantically similar.
Performance: In the industry-standard FinanceBench (which is essentially a "Table Stress Test" for AI), PageIndex scores 98.7% accuracy, compared to 30–50% for standard vector-based systems.
2. Implementation Strategy for Your "Digital Mission Twin"
For complex technical tables (like Mapping Matrices or Hazard Lists), you should adjust your local ingestion settings:
Use the
--if-add-node-text yesflag: When you run your local script, this ensures the full Markdown representation of the table is stored in your.jsonfile.Large Context Models: Since you have the Blackwell hardware, use models with larger context windows (like
llama3.1orgranite-3.0). This allows the LLM to "see" the entire table at once during the final extraction phase, maintaining the row-column relationships.
3. Comparison for S&MA Use Cases
| Feature | Traditional Vector RAG | PageIndex (Reasoning RAG) |
| Multi-page Tables | Often fragmented/lost | Kept as logical units |
| Numerical Precision | Poor (similar numbers blur) | High (Direct extraction) |
| Cross-References | Broken | Intact (Follows indices) |
| Best Hardware Match | High-RAM Vector DB | High-VRAM (MSI EdgeXpert) |
A "Veteran's" Warning on Complex PDF Tables
While PageIndex is superior, its "Goodness" depends on the underlying PDF parser. PageIndex uses tools like marker-pdf or PyMuPDF. If your NASA docs have extremely complex merged cells or multi-line headers, I recommend:
Checking the
summaryfield in your generated.json.If the summary for a table node looks "thin," increase the
--max-tokens-per-nodeto allow a more detailed table transcription.
Since you are syncing to Neo4j, you can even store the table data as a single JSON-string property on a :Section node. This makes it incredibly easy to pull the exact "Safety Matrix" you need during an automated audit.
Do you have a specific NASA Standard in mind that is particularly heavy on tables, like NASA-STD-8739.8B?
Both NPR 7150.2D and NASA-STD-8739.8B, as well as NASA SSP-50038
For a NASA S&MA veteran, the three documents you mentioned represent the "Triple Crown" of complex structural data. Each relies heavily on high-density tables that standard AI tools often struggle to parse.
Here is how PageIndex performs specifically against the table structures in these documents:
1. NPR 7150.2D (Software Engineering Requirements)
The Challenge: Appendix C (Requirements Mapping Matrix). This is a massive, multi-page table that maps software classes (A, B, C, etc.) to specific "SWE" requirements.
PageIndex Performance: Because PageIndex uses reasoning-based navigation instead of chunking, it treats Appendix C as a single, coherent logical node.
Advantage: When you ask, "Is SWE-050 required for Class B software?" PageIndex identifies the "Appendix C" node and allows the LLM to read the row-column relationship in its entirety. Standard RAG would likely split SWE-050 into one chunk and the Class B column into another, losing the connection.
2. NASA-STD-8739.8B (Software Assurance & Safety)
The Challenge: Table 1 (Requirements Mapping Matrix) and various safety-criticality assessment tables. These tables are often dense with citations to other NPRs.
PageIndex Performance: It excels at cross-document traceability. Since PageIndex preserves the table's structural integrity, the LLM can "see" the reference to NPR 7150.2 within the table cell.
Advantage: This is critical for your Neo4j sync. You can extract the safety-criticality thresholds from Table 1 as a structured JSON object and inject them directly into your Knowledge Graph as
:SAFETY_THRESHOLDproperties.
3. NASA SSP-50038 (ISS Computer-Based Control System Safety)
The Challenge: This is an older, legacy-formatted document (Revision B is often a scan or a complex older PDF). It contains critical tables regarding Control Path Separation and hazard controls.
PageIndex Performance: If you use the marker-pdf backend with PageIndex, it uses high-accuracy OCR and layout analysis to reconstruct these legacy tables into Markdown.
Advantage: Markdown-formatted tables are "native" to LLMs like IBM Granite or Llama 3.1. By converting the ISS safety tables into Markdown within the PageIndex node, you ensure your local hardware can query them with near-perfect precision.
Technical Recommendations for these Docs
To ensure the highest "Goodness" score in your local investigator app, I recommend running the PageIndex ingestion with these specific flags for these three documents:
| Parameter | Recommended Setting | Reason |
| --if-add-node-text | yes | Ensures the full Markdown table is saved in the JSON. |
| --max-tokens-per-node | 4000+ | Allows large NASA matrices (like Appx C) to stay in one piece. |
| --model | llama3-vision | Useful if SSP-50038 contains older diagrams or bit-mapped tables. |
Visualizing the Tables in Neo4j
Once you sync these to Neo4j, you don't just store the table as text. You can use your Python sync script to detect if a node contains a table and create a special label:
MATCH (s:Section {node_id: 'Appx_C'})
SET s:RequirementMatrix, s.is_table = true
This allows your Cortex intent language to say, "Search all RequirementMatrices for Class A gaps," specifically targeting the densest data nodes in your Mission Twin.
Are you planning to run the NPR 7150.2D ingestion first to establish the primary requirements baseline for your graph?
I am considering Nvidia nemotron-cascade-2 (30B, MOE with 3B active), Google Gemma4 (31B), IBM Granite4:32b-a9b-h, and GPT-OSS (20B)
With your MSI EdgeXpert (Grace Blackwell) and Ryzen AI Max+ setup, you are looking at the "bleeding edge" of the 20B–32B class for 2026. These models are particularly well-suited for your PageIndex workflow because they offer frontier-level reasoning (needed for complex NASA tables) while remaining small enough to run at high speed on local hardware.
Here is the strategic breakdown for your S&MA and Digital Mission Twin projects:
1. The Contenders: Comparative Specs
| Model | Architecture | Size (Total/Active) | Context | Key Strength |
| Nemotron-Cascade-2 | MoE (Cascaded RL) | 30B / 3B | 128k | Math & Coding Reasoning. Best for verifying numerical margins in NASA safety docs. |
| Gemma 4 (31B) | Dense | 31B / 31B | 256k | Agentic Workflows. Native multimodal support; best for parsing diagrams in SSP-50038. |
| Granite 4.0 H-Small | Hybrid (Mamba-2/Trans) | 32B / 9B | 512k | Massive Document RAG. Lowest memory overhead; ideal for cross-referencing all 3 NASA docs at once. |
| GPT-OSS 20B | MoE | 21B / 3.6B | 131k | Chain-of-Thought. Adjustable "reasoning effort"; allows you to trade speed for audit-level precision. |
2. Tactical Selection for your NASA Workflow
For High-Density Table Extraction (NPR 7150.2D)
Recommendation: Nemotron-Cascade-2
Why: Its post-training was specifically distilled for software engineering and structured outputs. When PageIndex identifies a Mapping Matrix, Nemotron is the least likely to "hallucinate" a row-column alignment error. Its 3B active parameters mean it will be lightning-fast on your RTX 4080.
For Long-Context Mission Twins (Cross-Document Audits)
Recommendation: IBM Granite 4.0 H-Small
Why: The Hybrid Mamba-2 architecture is a game-changer for your 500+ page PDFs. Unlike traditional transformers where the "attention" cost grows quadratically, Granite’s memory requirements stay flat. You can keep the entirety of SSP-50038 and NASA-STD-8739.8B in the active context window without crashing your local VRAM.
For Multi-Agent Research (The "Stanford Machine")
Recommendation: Gemma 4 (31B)
Why: It is built on Gemini 3 research and includes the Google Agent Development Kit (ADK). If you want your agents to "think" about how a safety requirement impacts a physical hardware design (using its native vision), Gemma 4 is the superior choice.
3. Hardware Deployment Strategy
On the MSI EdgeXpert (Grace Blackwell): Run Gemma 4 (31B) or Granite 4.0 H-Small at high quantization (Q8 or FP16). The massive memory bandwidth of the Blackwell architecture will make these 30B+ models feel as snappy as a 7B model.
On the Ryzen AI Max+ Mini PC: Run Nemotron-Cascade-2. Because it only activates 3B parameters, it will sip power and generate very little heat while providing top-tier reasoning for your "CogniWeave" journaling.
Summary for your "Cortex" Intent Language
If you are building the Cortex compiler, GPT-OSS 20B is the best "Inner Monologue" model. Its configurable reasoning effort allows you to set it to "High" when it is compiling intent into Cypher code for Neo4j, and "Low" when it is just summarizing a user's journal entry.
Which of these models would you like to use for the first "Quality Audit" pass on the NPR 7150.2D tables?
From <https://gemini.google.com/app/e377f3c6b34226d2> Google Gemini 3.0 (Fast)

Comments
Post a Comment