#4136 Multiple Document Lookup Table for Name Entities --> Excel File

#4136 I have a bunch of NASA program project documents in PDF format. They include ConOps, SRS, SDD and STPs. Describe step-by-step how I can automate the creation of a lookup table with links from named entity to each reference in the document segments

Given your background in software process engineering and your work with NASA S&MA, you likely need a solution that is both structurally sound and highly traceable. Automating the mapping of ConOps, SRS, SDD, and STP documents requires a pipeline that respects the hierarchical nature of these specific document types.

Here is a step-by-step technical workflow to automate a lookup table with segment-level deep links.

Phase 1: Pre-processing and Structural Extraction

NASA documents often use standardized numbering (e.g., Section 3.2.1). Instead of treating the PDF as a "blob" of text, you need to preserve the document architecture.

PDF to Markdown/JSON Conversion: Use a library like Marker or Unstructured.io. These are superior for technical documents because they identify headers, tables, and lists rather than just raw lines of text.
Semantic Chunking: Instead of fixed-size chunks (e.g., 500 characters), chunk by Section ID. This ensures that an SRS requirement or an SDD component remains a cohesive unit.
Metadata Tagging: For every extracted segment, attach a metadata object:
- DocType: (SRS, SDD, etc.)
- SectionID: (e.g., "4.1.2")
- SourcePath: (Local or Cloud URL)
- PageNumber: (For the "Link" part of your table)

Phase 2: Named Entity Recognition (NER)

You need to identify the "Keys" for your lookup table. General-purpose NER often misses domain-specific NASA acronyms or system components.

Define Entities: Focus on System Components, Requirements IDs, Subsystems, and Operational States.
Hybrid Extraction:
- LLM-based Extraction: Use an LLM (like Gemini 1.5 Pro with its large context window) to "list all unique system components and requirements referenced in this segment."
- Regex Refinement: Since you prefer regex for definitive extraction, use a regex layer to catch standardized strings like [REQ-XXX-001] or specific NASA project codes.
Entity Normalization: Ensure that "Communication System" and "Comms" point to the same entity ID in your table.

Phase 3: Building the Relationship Graph

This step creates the "Reference" mapping.

Generate Embeddings: Convert each document segment into a vector.
The Mapping Logic: Create a script (Python/Pandas) that iterates through your extracted entities and searches the segments for occurrences.
Deep Link Generation:
- If using a PDF viewer like Adobe or Chrome, append #page=N to the URL.
- If using a web-based tool, use an anchor link based on the SectionID.

Phase 4: Constructing the Lookup Table

The final output should be a structured format (CSV, JSON, or a Knowledge Graph) that serves as your index.

Entity	Doc Type	Section	Reference Text Segment	Deep Link
GNC Subsystem	SRS	3.1.2	"The GNC subsystem shall..."	file://path/srs.pdf#page=12
Lunar Gateway	ConOps	2.4	"Initial docking for Gateway..."	file://path/conops.pdf#page=45

Recommended Tech Stack

Orchestration: Python (Pandas for the table, LangChain or LlamaIndex for the pipeline).
Parsing: PyMuPDF (for coordinates) or Unstructured (for logic).
Storage: A simple SQLite database or a Knowledge Graph (like Neo4j) if you want to see how an SRS requirement flows into an SDD component.

A Possible Next Step

Since you mentioned an interest in Knowledge Graphs and Wolfram Language, would you like me to provide a Python script that uses spaCy or an LLM to extract these entities and format them into a JSON-L file ready for a graph import?

This needs to handle multiple links to each named entity

To handle a one-to-many relationship—where a single named entity (like a specific "Propulsion System" or a Requirement ID) appears in multiple places across several documents—your lookup table needs a list-based or relational structure.

Since you mentioned a preference for regex and likely use Python for your engineering tasks, here is the architectural approach to building a "Master Index" that handles multiple links.

1. Data Structure: The "Inverted Index"

Instead of a flat table, you want an Inverted Index structure. In Python, this is essentially a dictionary where the key is the Entity and the value is a list of reference objects.

Target JSON/Dictionary Schema:

{
  "GNC-REQ-001": [
    {"doc": "SRS_v1.pdf", "section": "3.2", "page": 12, "context": "...shall provide navigation..."},
    {"doc": "STP_v1.pdf", "section": "5.1", "page": 45, "context": "Verify GNC-REQ-001 via..."}
  ]
}

2. Implementation Step-by-Step

Step A: Multi-Document Parsing

Use PyMuPDF (fitz) because it allows you to extract text with precise page numbers, which are required for your deep links.

import fitz  # PyMuPDF
import re
from collections import defaultdict

# 1. Define your Regex Patterns for NASA entities
# Example: Matches 'REQ-XXXX-000' or common NASA Acronyms
ENTITY_RE = re.compile(r'(REQ-[A-Z0-9]+-\d{3}|[A-Z]{3,}(?:\s[A-Z]{3,})*)')

master_lookup = defaultdict(list)
pdf_files = ["ConOps.pdf", "SRS.pdf", "SDD.pdf", "STP.pdf"]

for file_path in pdf_files:
    doc = fitz.open(file_path)
    for page_num, page in enumerate(doc, start=1):
        text = page.get_text("text")
        
        # Find all entities on this page
        matches = ENTITY_RE.findall(text)
        
        for entity in set(matches): # use set to avoid duplicates on the same page
            master_lookup[entity].append({
                "source": file_path,
                "page": page_num,
                "link": f"file:///{file_path}#page={page_num}"
            })

Step B: Handling "Document Segments" (Section IDs)

To get the "Section 3.2.1" metadata, you need a Stateful Parser. As you iterate through the text lines, look for lines starting with digits (e.g., ^(\d+\.\d+\.?\d*)). Store the "Current Section" in a variable so that when a regex match is found, it is tagged with that section.

Step C: Flattening for a Spreadsheet (CSV/Excel)

If you need this in a flat table for others to use, you must decide between:

Exploded View: One row per occurrence (Entity "X" appears 5 times = 5 rows). This is best for filtering.
Aggregated View: One row per entity, with all links in a single cell (separated by Newlines).

Python snippet to flatten for a table:

import pandas as pd

rows = []
for entity, references in master_lookup.items():
    for ref in references:
        rows.append({
            "Entity": entity,
            "Document": ref['source'],
            "Page": ref['page'],
            "Deep Link": ref['link']
        })

df = pd.DataFrame(rows)
df.to_excel("NASA_Program_Lookup.xlsx", index=False)

3. Verification & Refinement

The "Ambiguity" Filter: NASA docs often have "Header" mentions vs "Requirement" mentions. You can refine your regex to ignore Table of Contents by checking if the page number is < 5 or if the text contains a string of dots .....
Context Windows: To make the table useful, don't just grab the link; grab the Sentence containing the entity. Use re.finditer to find the character index and slice text[start-50 : end+50].

Would you like me to...

Refine the Regex pattern for a specific NASA project's naming convention (e.g., Artemis or Gateway)?
Show you how to render this as an interactive HTML table where the links are clickable?

Consider a first pass through all documents identifying the possible list of name entities across the documents. This list would then be cleansed and trimmed as needed to be used in the second pass using semantic search within each document to find relevant links

This two-pass strategy is highly efficient for high-stakes documentation like NASA’s. By separating "Discovery" from "Mapping," you ensure that the lookup table doesn't get cluttered with noisy OCR errors or irrelevant boilerplate text.

Here is the step-by-step automation workflow for this two-pass approach.

Pass 1: Global Entity Discovery & Cleansing

The goal of the first pass is to build a "Golden Glossary" of all unique entities across the ConOps, SRS, SDD, and STP files.

Bulk Text Extraction: Use PyMuPDF or Unstructured to dump the text from all PDFs into a unified stream or a list of "raw chunks."
High-Recall Entity Extraction: * LLM Extraction: Send document chunks to an LLM with the prompt: "Extract all technical system components, requirement IDs, and NASA-specific subsystems. List them as unique strings."
- Regex Safety Net: Simultaneously run your regex patterns (e.g., (REQ-[A-Z0-9]+-\d{3})) to ensure 100% capture of structured IDs.
The "Cleansing" Interface: * Deduplication: Use Fuzzy Matching (via RapidFuzz or thefuzz in Python) to group variations. (e.g., "GNC System" vs. "GNC Subsystem" vs. "G.N.C.").
- Trimming: Filter the list against a "Stop Word" list (e.g., common words like "Table," "Figure," "NASA," "Document" that might have been flagged as entities).
- Human-in-the-loop (Optional): Export this list to a CSV/Excel. You (the expert) quickly delete rows that are irrelevant. This curated CSV is now your Search Library.

Pass 2: Targeted Semantic Mapping

Now that you have a clean list of entities, you perform a second, high-precision pass to find every relevant link.

Document Vectorization:
- Chunk the documents by Section/Paragraph.
- Convert these chunks into vector embeddings using a model like text-embedding-3-small or a local HuggingFace model (all-MiniLM-L6-v2).
- Store these in a local Vector Database (like ChromaDB or FAISS) with metadata including: Source_File, Page_Number, Section_Title, and Deep_Link_URL.
Automated Semantic Lookup:
- For each entity in your cleansed Pass 1 list, perform a semantic search against the Vector DB.
- Thresholding: Only keep results with a similarity score > 0.85 to ensure the context is actually relevant to the entity.
- Cross-Reference Logic: If an entity is a "Requirement ID," also perform a standard keyword search to catch every literal mention, even in tables or footnotes.
Lookup Table Assembly:
- Join the search results into a single table. Because you are handling multiple links, use a "Long" format for the database and a "Nested" format for the final UI.

The Technical Execution (Python Concept)

Pass 1: Discover

# Pseudo-code for Pass 1: Entity Discovery
unique_entities = set()
for chunk in doc_chunks:
    # Use LLM to get 'candidate' entities
    candidates = get_llm_entities(chunk) 
    unique_entities.update(candidates)

# Export for manual 'Trim'
save_to_csv(list(unique_entities), "candidate_entities.csv")

Pass 2: Map

# Pseudo-code for Pass 2: Semantic Mapping
search_library = load_cleansed_csv("candidate_entities.csv")
final_table = []

for entity in search_library:
    # Semantic Search returns multiple matches
    results = vector_db.query(query_text=entity, n_results=10) 
    
    for match in results:
        final_table.append({
            "Entity": entity,
            "Document": match.metadata['source'],
            "Link": f"{match.metadata['url']}#page={match.metadata['page']}",
            "Context": match.text
        })

Why this works for NASA S&MA:

Traceability: By using semantic search in Pass 2, you find "Implied References" (where the entity is discussed without being named explicitly).
Precision: Pass 1 ensures you aren't searching for "garbage" entities, which keeps your vector database queries focused.

Would you like a specific Python script to handle the Fuzzy Matching and deduplication for Pass 1?

Yes

To implement the first pass effectively, we’ll use a combination of High-Recall Extraction (finding everything), followed by Fuzzy Matching (grouping the "slop").

Since you are a software engineer, this script is designed to be modular. It uses RapidFuzz for performance (it's significantly faster than FuzzyWuzzy for large document sets) and Sentence-Transformers for the semantic grouping.

Prerequisites

pip install rapidfuzz sentence-transformers pandas

Pass 1: Entity Discovery and Fuzzy Deduplication

This script takes your raw list of candidates (from your LLM/Regex extraction) and collapses them into a "Golden List."

import pandas as pd
from rapidfuzz import process, fuzz
from sentence_transformers import SentenceTransformer, util
import torch

def cleanse_entities(raw_entities, similarity_threshold=90):
    """
    Step 1: Fuzzy String Matching (Syntactic)
    Groups 'GNC System' and 'GNC Systems' or 'Propulsion Sys.'
    """
    unique_list = sorted(list(set(raw_entities)))
    cleansed = []
    already_processed = set()

    for item in unique_list:
        if item in already_processed:
            continue
        
        # Find all items similar to this one
        matches = process.extract(item, unique_list, scorer=fuzz.token_sort_ratio, score_cutoff=similarity_threshold)
        
        # Pick the longest string among matches as the 'Master' name
        group = [m[0] for m in matches]
        master_name = max(group, key=len)
        
        cleansed.append(master_name)
        already_processed.update(group)
    
    return cleansed

def semantic_grouping(entities):
    """
    Step 2: Semantic Deduplication (Optional)
    Groups 'Guidance and Navigation' with 'GNC' if the embedding is close.
    """
    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(entities, convert_to_tensor=True)
    
    # Calculate cosine similarity matrix
    cosine_scores = util.cos_sim(embeddings, embeddings)
    
    clusters = []
    visited = set()

    for i in range(len(entities)):
        if i in visited:
            continue
        
        # Find indices where similarity is high (e.g., > 0.85)
        similar_indices = torch.where(cosine_scores[i] > 0.85)[0].tolist()
        
        # Keep the first one as the cluster representative
        clusters.append(entities[i])
        visited.update(similar_indices)
        
    return clusters

# --- Execution ---
# Imagine this list came from your LLM/Regex pass across ConOps/SRS
raw_candidates = [
    "GNC Subsystem", "GNC System", "Guidance, Navigation, and Control",
    "Liquid Propulsion Assembly", "Propulsion Assy", "LPA",
    "Thermal Control Unit", "TCU", "Thermal Unit"
]

# Pass 1.1: Standardize strings
fuzzy_cleansed = cleanse_entities(raw_candidates)

# Pass 1.2: Deep semantic grouping (handling acronyms/synonyms)
final_golden_list = semantic_grouping(fuzzy_cleansed)

print("Golden List for Pass 2 Search:")
for entity in final_golden_list:
    print(f"- {entity}")

How to use this for your NASA docs:

The "Candidate" CSV: Run your first pass using a broad LLM prompt. Don't worry about duplicates; just get every possible noun/acronym into a CSV column.
Adjusting the Threshold:
- In cleanse_entities, a similarity_threshold of 90 is strict (catches typos). 80 is loose (catches "Propulsion" vs "Propulsion System").
- In semantic_grouping, a score of 0.85 is usually the sweet spot for finding synonyms without getting "hallucinated" connections.
The Trimming Phase: Before moving to Pass 2, I recommend outputting the final_golden_list to an Excel sheet. Add a column called Include? (Y/N). This allows you to manually "trim" the noise that automation inevitably catches (like "Appendix A" or "Table 4").

Phase 2 Preview: Automated Search

Once you have this clean list, the "second pass" script will iterate through this list and use a Vector Database (like ChromaDB) to perform:

results = collection.query(query_texts=["GNC Subsystem"], n_results=5)

This will return the top 5 document segments (with links) for every entity in your golden list.

Would you like me to generate the Phase 2 script that takes this Golden List and creates the final multi-link Excel report?

Yes

To complete the automation, this Phase 2 script takes your "Golden List" and maps it back to the source PDFs.

Given your preference for structured engineering and your work with NASA S&MA data, I’ve designed this to use ChromaDB (a lightweight, local vector store). This allows you to perform semantic "fuzzy" searches across thousands of document segments simultaneously to find relevant links, even if the phrasing differs slightly from your entity list.

Prerequisites

Bash

pip install chromadb pypdf pandas

Phase 2: Mapping Entities to Document Segments

This script builds a temporary local index of your PDFs and then queries that index for every entity in your cleansed list.

Python
import chromadb
from chromadb.utils import embedding_functions
import pandas as pd
from pypdf import PdfReader
import uuid

# 1. Initialize Local Vector Store
client = chromadb.Client()
# Using a default light model; you can swap for OpenAI/Gemini embeddings if preferred
emb_fn = embedding_functions.DefaultEmbeddingFunction()
collection = client.get_or_create_collection(name="nasa_docs", embedding_function=emb_fn)

def index_documents(pdf_paths):
    """Chunks PDFs by page/section and adds them to the vector store."""
    for path in pdf_paths:
        reader = PdfReader(path)
        for i, page in enumerate(reader.pages):
            text = page.extract_text()
            if not text.strip(): continue
            
            # Metadata is key for your lookup table 'links'
            collection.add(
                documents=[text],
                metadatas=[{"source": path, "page": i + 1, "link": f"{path}#page={i+1}"}],
                ids=[str(uuid.uuid4())]
            )

def build_lookup_table(golden_entities, n_results=3):
    """Queries the index for each entity and flattens into a multi-link table."""
    final_data = []

    for entity in golden_entities:
        # Semantic search: finds the most relevant segments across all docs
        results = collection.query(
            query_texts=[entity],
            n_results=n_results
        )
        
        # results['documents'][0] contains the text, results['metadatas'][0] contains links
        for doc_text, metadata in zip(results['documents'][0], results['metadatas'][0]):
            final_data.append({
                "Entity": entity,
                "Document": metadata['source'],
                "Page": metadata['page'],
                "Link": metadata['link'],
                "Context Snippet": doc_text[:200].replace("\n", " ") + "..."
            })
            
    return pd.DataFrame(final_data)

# --- Execution ---
pdf_list = ["ConOps.pdf", "SRS.pdf", "SDD.pdf", "STP.pdf"] # Replace with your files
golden_list = ["GNC Subsystem", "Liquid Propulsion Assembly", "Thermal Control Unit"] # From Pass 1

print("Indexing documents...")
index_documents(pdf_list)

print("Mapping entities...")
df_report = build_lookup_table(golden_list)

# Export to Excel for your final review
df_report.to_excel("NASA_Entity_Reference_Table.xlsx", index=False)
print("Done! Lookup table generated.")

Features for the NASA Context

One-to-Many Handling: The n_results parameter in the query ensures that for a single entity (e.g., "GNC Subsystem"), you retrieve the top $N$ most relevant mentions across all documents.
Deep Linking: The metadata dictionary automatically constructs the #page=N syntax. When this Excel file is opened, those strings are often clickable or can be converted to hyperlinks easily.
Context Snippets: I included a Context Snippet column. This is vital for S&MA because it allows you to verify why the link was created without opening the PDF immediately.

Engineering Refinement

Since you are a NASA S&MA contractor, you might want to ensure "Requirement Traceability." You can modify the index_documents function to use a Regex that detects section headers (like 3.2.1) and adds that specific string to the metadata. This would allow your table to show the Section Number alongside the Page Number.

Would you like me to show you how to add a "Section Detection" regex to that indexing loop?

Yes

To implement Section Detection, we need to modify the indexing loop to act as a "stateful" parser. As the script iterates through the text of each page, it will use a regex to look for standard NASA hierarchical numbering (e.g., 3.2.1 Analysis) and maintain that as the "Current Section" for all entities found until the next header appears.

Updated Indexing Script with Section Detection

This version of the function replaces the previous index_documents logic. It splits the page text into lines to more accurately capture the transition from one section to another.

Python
import re

# Standard NASA Section Regex: Matches 1.0, 3.2.1, A.1, etc. at the start of a line
SECTION_RE = re.compile(r'^(\d+(?:\.\d+)+|[A-Z]\.\d+)\s+([A-Z].*)')

def index_documents_with_sections(pdf_paths):
    """Chunks PDFs and tracks the active section for better traceability."""
    for path in pdf_paths:
        reader = PdfReader(path)
        current_section_id = "Front Matter"
        current_section_title = ""

        for i, page in enumerate(reader.pages):
            page_num = i + 1
            text = page.extract_text()
            if not text.strip(): continue

            lines = text.split('\n')
            page_segments = []
            
            # Sub-page tracking to keep section context accurate
            for line in lines:
                header_match = SECTION_RE.match(line.strip())
                if header_match:
                    current_section_id = header_match.group(1)
                    current_section_title = header_match.group(2).strip()

                # Add text to the vector store tagged with the 'latest' section found
                # We chunk by paragraph or significant line breaks here
                if line.strip():
                    collection.add(
                        documents=[line],
                        metadatas=[{
                            "source": path,
                            "page": page_num,
                            "section": f"{current_section_id} {current_section_title}",
                            "link": f"{path}#page={page_num}"
                        }],
                        ids=[str(uuid.uuid4())]
                    )

# Update build_lookup_table to include the new 'section' metadata
def build_lookup_table_v2(golden_entities, n_results=5):
    final_data = []
    for entity in golden_entities:
        results = collection.query(query_texts=[entity], n_results=n_results)
        
        for doc_text, metadata in zip(results['documents'][0], results['metadatas'][0]):
            final_data.append({
                "Entity": entity,
                "Document": metadata['source'],
                "Section": metadata.get('section', 'N/A'), # New column
                "Page": metadata['page'],
                "Link": metadata['link'],
                "Context": doc_text.strip()
            })
    return pd.DataFrame(final_data)

How this enhances your Workflow

Requirement Mapping: If you are mapping an SRS requirement to an SDD component, having the Section metadata (e.g., Section 3.2.1 Software Interfaces) is much more useful than just a page number, as it aligns with the document's internal logic.
Breadcrumbs: By tracking current_section_id across lines, the script handles the common PDF issue where a section header appears at the bottom of Page 10, but the relevant content is at the top of Page 11.
Traceability Matrix: This output can be directly imported into a requirements management tool or a Knowledge Graph as "Reference" nodes.

Strategic Tip for NASA S&MA

Since you are dealing with ConOps through STP, you may find that the same entity has different "meanings" in different phases. For example, a "Heater" in the ConOps is a capability, but in the SDD it's a specific hardware component.

Would you like me to add a "Document Type" classifier to the script so it can automatically bucket your links by lifecycle phase (e.g., "Requirement Mention" vs "Design Mention")?

Yes

To bucket your links effectively, we can implement a Lifecycle Classifier. Since you are working with the standard NASA sequence—ConOps (Concept), SRS (Requirements), SDD (Design), and STP (Test)—we can use the filename or internal metadata to map each reference to its specific lifecycle phase.

This allows your final lookup table to show exactly where an entity is defined, where it is designed, and where it is verified.

Step 1: Define the Lifecycle Mapping

First, we create a mapping dictionary that translates your document types into their respective S&MA phases.

Python
LIFECYCLE_PHASES = {
    "ConOps": "Phase A: Concept & Operational Logic",
    "SRS": "Phase B: Requirements & Functional Baseline",
    "SDD": "Phase C/D: Design & Implementation",
    "STP": "Phase D/E: Verification & Testing"
}

def classify_doc_type(filename):
    """Detects the NASA document type from the filename or path."""
    filename_upper = filename.upper()
    for key, phase in LIFECYCLE_PHASES.items():
        if key in filename_upper:
            return phase
    return "General / Support"

Step 2: Integrated Pass 2 Script

This version integrates Section Detection, Multi-link handling, and Lifecycle Classification into one clean output.

Python
import pandas as pd

def build_final_nasa_index(golden_entities, n_results=10):
    final_data = []

    for entity in golden_entities:
        # Search the vector store populated in the previous steps
        results = collection.query(query_texts=[entity], n_results=n_results)
        
        for doc_text, metadata in zip(results['documents'][0], results['metadatas'][0]):
            filename = metadata['source']
            phase = classify_doc_type(filename)
            
            final_data.append({
                "Entity": entity,
                "Lifecycle Phase": phase,
                "Document Type": filename.split('/')[-1].split('.')[0],
                "Section": metadata.get('section', 'N/A'),
                "Page": metadata['page'],
                "Link": metadata['link'],
                "Context Snippet": doc_text.strip()[:250]
            })

    # Convert to DataFrame
    df = pd.DataFrame(final_data)
    
    # Sort by Entity and then by Phase order to show the 'story' of the entity
    # (Optional: You can define a custom sort order for the Lifecycle Phases)
    return df.sort_values(by=["Entity", "Lifecycle Phase"])

# --- Final Execution ---
# Assuming 'final_golden_list' from Pass 1 is available
nasa_report_df = build_final_nasa_index(final_golden_list)

# Save with formatting
with pd.ExcelWriter("NASA_Traceability_Index.xlsx", engine='xlsxwriter') as writer:
    nasa_report_df.to_excel(writer, sheet_name='Lookup Table', index=False)
    # The 'xlsxwriter' allows you to auto-filter and format column widths programmatically

How to use this for S&MA Analysis:

Gap Analysis: If an entity (e.g., "Main Engine Controller") appears in the SRS and SDD but has zero hits in the STP, you’ve just automated the identification of a testing gap.
Traceability Logic: You can now filter your Excel sheet by "Entity" to see the chronological "evolution" of that component from a concept (ConOps) to a verified system (STP).
Contextual Sourcing: By including the Context Snippet, you can distinguish between a "Shallow Mention" (the entity is just listed in an acronym list) and a "Deep Mention" (the entity is the subject of a requirement).

A Final Thought on "Precision"

Since you are a Maker and an AI applier, you might notice that some semantic results are "close but not quite." You can add a Validation Pass where you ask the LLM: "Does this text segment actually define or describe [Entity]? Answer Yes/No." This removes any semantic "hallucinations" before the table is finalized.

Would you like me to add that "LLM Validation" step to ensure the links are 100% accurate before exporting?

Yes

To ensure 100% accuracy—critical for S&MA—this final step adds a Verification Gate. It uses the LLM as a "Reasoning Filter" to check the semantic search results. This eliminates "false positives" where an entity name might appear in a footer, a table of contents, or an unrelated context.

The LLM Validation Logic

We will wrap the mapping loop with a prompt that asks the model to confirm if the document segment provides meaningful information about the entity.

def validate_reference(entity, context_segment):
    """
    Uses a small, fast prompt to verify the relevance of the link.
    Returns True if the segment is a valid reference, False if it's noise.
    """
    prompt = f"""
    Entity: {entity}
    Document Segment: {context_segment}
    
    Task: Is this segment a substantive reference to the entity? 
    Ignore: Table of contents, page headers, or lists of acronyms without definitions.
    Focus: Requirements, design descriptions, or operational steps.
    
    Answer only with 'YES' or 'NO'.
    """
    # Assuming you are using the Gemini API or a local equivalent
    response = model.generate_content(prompt).text.strip().upper()
    return "YES" in response

def build_validated_index(golden_entities, n_results=10):
    validated_data = []

    for entity in golden_entities:
        results = collection.query(query_texts=[entity], n_results=n_results)
        
        for doc_text, metadata in zip(results['documents'][0], results['metadatas'][0]):
            # THE VALIDATION GATE
            if validate_reference(entity, doc_text):
                filename = metadata['source']
                phase = classify_doc_type(filename)
                
                validated_data.append({
                    "Entity": entity,
                    "Status": "Verified",
                    "Lifecycle Phase": phase,
                    "Document": filename,
                    "Section": metadata.get('section', 'N/A'),
                    "Page": metadata['page'],
                    "Link": metadata['link'],
                    "Context": doc_text.strip()[:300]
                })
            else:
                # Optional: Log the 'Rejected' links for a separate audit trail
                pass

    return pd.DataFrame(validated_data)

The Final Step-by-Step Pipeline Recap

Discovery (Pass 1): Broad extraction via Regex and LLM across all PDFs to find every possible entity.
Cleanse: Use the Fuzzy/Semantic script to deduplicate "GNC System" vs "GNC Subsystem" into a single "Golden List."
Index: Chunk the PDFs into a ChromaDB vector store, tagging each line with the Active Section Header (e.g., Section 3.1) and Lifecycle Phase.
Map (Pass 2): Run each "Golden Entity" through the vector store to find the top $N$ potential links.
Validate: Pass the search results through the LLM Validation Gate to strip out noise (headers/footers).
Export: Generate the final Excel/Markdown Lookup Table with deep links and lifecycle bucketing.

Why this fits your "Maker" approach:

This pipeline is essentially a Document Intelligence Agent. By the time you open the Excel file, the "dirty work" of sorting through thousands of pages of NASA documentation has been reduced to a curated list of verified, high-value connections.

Suggested Next Step

Since you are interested in Knowledge Graphs, would you like me to show you how to transform this final validated_data DataFrame into a Cypher script or a JSON-LD format? This would allow you to visualize the "Requirement $\rightarrow$ Design --> Test" flow in a graph tool like Neo4j or Wolfram Mathematica.

Combine all that we have discussed into one multi-pass Python application

This integrated application combines your two-pass strategy, regex-driven section detection, semantic search, NASA lifecycle classification, and LLM verification.

I have structured this as a modular Python class. You can swap out the llm_query function to use whichever API (Gemini, OpenAI) or local model you are currently running in your environment.

Prerequisites

pip install chromadb rapidfuzz sentence-transformers pypdf pandas xlsxwriter

The Integrated NASA Document Intelligence Application

import os
import re
import uuid
import pandas as pd
from pypdf import PdfReader
from rapidfuzz import process, fuzz
from sentence_transformers import SentenceTransformer, util
import chromadb
from chromadb.utils import embedding_functions

# --- Configuration & Mock LLM Interface ---
# Replace this with your actual LLM client (e.g., Gemini or OpenAI)
def llm_query(prompt):
    """Placeholder for your LLM call."""
    # Logic: If it's the 'Discovery' prompt, return a list. 
    # If it's the 'Validation' prompt, return YES/NO.
    return "YES" 

class NASADocManager:
    def __init__(self, db_path="./nasa_vectordb"):
        self.client = chromadb.PersistentClient(path=db_path)
        self.emb_fn = embedding_functions.DefaultEmbeddingFunction()
        self.collection = self.client.get_or_create_collection(
            name="nasa_docs", embedding_function=self.emb_fn
        )
        self.section_re = re.compile(r'^(\d+(?:\.\d+)+|[A-Z]\.\d+)\s+([A-Z].*)')
        self.lifecycle_map = {
            "CONOPS": "Phase A: Concept",
            "SRS": "Phase B: Requirements",
            "SDD": "Phase C/D: Design",
            "STP": "Phase D/E: Test"
        }

    # --- Pass 1: Discovery & Cleansing ---
    def discover_entities(self, pdf_paths):
        print("[Pass 1] Discovering unique entities...")
        raw_entities = []
        for path in pdf_paths:
            reader = PdfReader(path)
            # Sample first 20 pages to find entities (or full doc for thoroughness)
            text = " ".join([p.extract_text() for p in reader.pages[:20]])
            # In a real run, use the LLM to extract names/IDs from this text
            # raw_entities.extend(llm_extract_entities(text))
        
        # Mock results for demonstration
        raw_candidates = ["GNC System", "GNC Subsystem", "Propulsion Assy", "LPA", "Main Controller"]
        
        # Fuzzy Deduplication
        cleansed = []
        already_processed = set()
        for item in sorted(list(set(raw_candidates))):
            if item in already_processed: continue
            matches = process.extract(item, raw_candidates, scorer=fuzz.token_sort_ratio, score_cutoff=85)
            group = [m[0] for m in matches]
            master_name = max(group, key=len)
            cleansed.append(master_name)
            already_processed.update(group)
        return cleansed

    # --- Intermediate: Indexing ---
    def index_documents(self, pdf_paths):
        print("[Indexing] Building vector store with section tracking...")
        for path in pdf_paths:
            reader = PdfReader(path)
            doc_type = next((v for k, v in self.lifecycle_map.items() if k in path.upper()), "General")
            curr_sec_id, curr_sec_title = "0.0", "Front Matter"

            for i, page in enumerate(reader.pages):
                lines = page.extract_text().split('\n')
                for line in lines:
                    if not line.strip(): continue
                    # Update section if header found
                    match = self.section_re.match(line.strip())
                    if match:
                        curr_sec_id, curr_sec_title = match.groups()

                    self.collection.add(
                        documents=[line],
                        metadatas=[{
                            "source": os.path.basename(path),
                            "phase": doc_type,
                            "section": f"{curr_sec_id} {curr_sec_title}",
                            "page": i + 1,
                            "link": f"{path}#page={i+1}"
                        }],
                        ids=[str(uuid.uuid4())]
                    )

    # --- Pass 2: Mapping & Validation ---
    def build_validated_lookup(self, golden_entities):
        print("[Pass 2] Mapping and Validating references...")
        results_list = []
        for entity in golden_entities:
            # Semantic search
            query_results = self.collection.query(query_texts=[entity], n_results=5)
            
            for doc, meta in zip(query_results['documents'][0], query_results['metadatas'][0]):
                # LLM Verification Gate
                is_valid = llm_query(f"Is '{entity}' meaningfully mentioned in: {doc}?")
                
                if is_valid == "YES":
                    results_list.append({
                        "Entity": entity,
                        "Lifecycle Phase": meta['phase'],
                        "Section": meta['section'],
                        "Page": meta['page'],
                        "Context": doc.strip(),
                        "Link": meta['link']
                    })
        return pd.DataFrame(results_list)

# --- Execution Workflow ---
if __name__ == "__main__":
    manager = NASADocManager()
    docs = ["Artemis_ConOps.pdf", "Gateway_SRS.pdf", "GNC_SDD.pdf", "System_STP.pdf"]

    # 1. Discover
    golden_list = manager.discover_entities(docs)

    # 2. Index (Only needs to run once)
    manager.index_documents(docs)

    # 3. Map & Validate
    final_df = manager.build_validated_lookup(golden_list)

    # 4. Export
    final_df.to_excel("NASA_Automated_Lookup_Table.xlsx", index=False)
    print("Application complete. Results saved to Excel.")

Key Workflow Advantages for Your Role:

Traceability: Every link is tagged with its Lifecycle Phase and Section, allowing you to see the "evolution" of a requirement into a design.
Reduced Noise: The llm_query verification gate ensures that your lookup table doesn't get flooded with page headers or index listings.
Deep Links: The #page=N syntax in the Link column allows for immediate navigation from your spreadsheet to the technical evidence.

Would you like me to help you refine the llm_query prompt specifically for NASA Safety & Mission Assurance (S&MA) criteria?

From <https://gemini.google.com/app/> Google Gemini 3.0 Pro (Fast)