#3885 Semantic Search Phrases for CBCS and Safety-Critical Software; CLI-Ready Script

#3885 Semantic Search Phrases for CBCS and Safety-Critical Software; CLI-Ready Script

#3885 Think hard. List 25 phrases that I can use for semantic search of NASA hazard reports to determine whether the hazard report discusses Computer Based Control Systems (CBCS) and safety-critical software

Here are 25 phrases optimized for semantic search within NASA hazard reports.

These phrases are designed to capture not just keywords, but the contextual meaning of Computer Based Control Systems (CBCS) and safety-critical software, ranging from high-level classifications to specific failure modes and architectural mechanisms.

General Classification & Authority

  1. "Software classified as Class A or Class B safety critical"

  2. "System defined as a Computer Based Control System (CBCS)"

  3. "Compliance with NASA-STD-8719.13 software safety requirements"

  4. "Software residing in the safety critical control path"

  5. "Firmware or complex electronics performing safety functions"

Control Logic & Actuation

  1. "Software commands initiating hazardous energy release"

  2. "Algorithm logic errors resulting in inadvertent actuation"

  3. "Closed-loop control systems monitoring safety parameters"

  4. "Autonomous decision making for hazard control"

  5. "Software bypass of hardware safety interlocks"

Failure Modes & Anomalies

  1. "Loss of processor control resulting in uncommanded state change"

  2. "Latency in control loop processing violating timing requirements"

  3. "Race conditions in command execution sequences"

  4. "Buffer overflow causing system crash or unpredictable behavior"

  5. "Single Event Upset (SEU) causing bit flips in control logic"

  6. "Software failure to detect or isolate sensor malfunction"

Mitigation, Inhibits & Defense

  1. "Independent software inhibits preventing premature deployment"

  2. "Watchdog timer failure to reset processor during lockup"

  3. "Cyclic Redundancy Check (CRC) failures indicating data corruption"

  4. "Software transition to safe mode during anomaly detection"

  5. "Voter logic discrepancies in redundant computing strings"

Data, Interfaces & Configuration

  1. "Corrupted telemetry data causing incorrect automated response"

  2. "Bus traffic saturation preventing critical command transmission"

  3. "Incorrect parameter tables or configuration data loaded into flight computer"

  4. "Failure of fault management system to reconfiguration software"


Notes on Semantic Search Strategy

  • Concept vs. Keyword: Semantic search engines (like those using vector embeddings) will understand that "uncommanded state change" is conceptually similar to "inadvertent firing," even if the words differ.

  • The "Negative" Search: Be aware that semantic search might return reports that mention these phrases only to rule them out (e.g., "This system is not considered a CBCS"). You may need to filter for positive assertions in your Knowledge Graph logic.

Would you like me to help you write a Python script using a library like sentence-transformers to test these phrases against a sample PDF hazard report?


Yes, with the capability to read a directory of PDFs

This Python script uses the sentence-transformers library (which runs locally and is free) and pypdf to scan a directory of PDF files.

It implements semantic search, meaning it converts your 25 phrases and the PDF content into vector embeddings to find conceptual matches, even if the exact keywords don't align.

Prerequisites

You will need to install a few libraries. Run this in your terminal:

pip install sentence-transformers pypdf pandas torch tqdm

The Python Script

Create a file named hazard_search.py and update the pdf_directory variable to point to your folder.

import os
import torch
import pandas as pd
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm

# --- CONFIGURATION ---
# Update this path to your folder containing the Hazard Reports
PDF_DIRECTORY = "./hazard_reports" 

# Threshold (0.0 to 1.0). 
# 0.35-0.4 is usually a good starting point for semantic relevance.
SIMILARITY_THRESHOLD = 0.35 

# --- THE 25 SEMANTIC SEARCH PHRASES ---
search_phrases = [
    "Software classified as Class A or Class B safety critical",
    "System defined as a Computer Based Control System (CBCS)",
    "Compliance with NASA-STD-8719.13 software safety requirements",
    "Software residing in the safety critical control path",
    "Firmware or complex electronics performing safety functions",
    "Software commands initiating hazardous energy release",
    "Algorithm logic errors resulting in inadvertent actuation",
    "Closed-loop control systems monitoring safety parameters",
    "Autonomous decision making for hazard control",
    "Software bypass of hardware safety interlocks",
    "Loss of processor control resulting in uncommanded state change",
    "Latency in control loop processing violating timing requirements",
    "Race conditions in command execution sequences",
    "Buffer overflow causing system crash or unpredictable behavior",
    "Single Event Upset (SEU) causing bit flips in control logic",
    "Software failure to detect or isolate sensor malfunction",
    "Independent software inhibits preventing premature deployment",
    "Watchdog timer failure to reset processor during lockup",
    "Cyclic Redundancy Check (CRC) failures indicating data corruption",
    "Software transition to safe mode during anomaly detection",
    "Voter logic discrepancies in redundant computing strings",
    "Corrupted telemetry data causing incorrect automated response",
    "Bus traffic saturation preventing critical command transmission",
    "Incorrect parameter tables or configuration data loaded into flight computer",
    "Failure of fault management system to reconfiguration software"
]

def load_model():
    """Loads the embedding model. 
    'all-MiniLM-L6-v2' is fast and effective for this use case."""
    print("Loading Semantic Model (this may take a moment)...")
    # This runs locally on your machine.
    model = SentenceTransformer('all-MiniLM-L6-v2')
    return model

def extract_text_from_pdf(file_path):
    """Extracts text from PDF, chunked by page."""
    try:
        reader = PdfReader(file_path)
        pages_text = []
        for i, page in enumerate(reader.pages):
            text = page.extract_text()
            if text:
                # Clean up basic whitespace to help the model
                clean_text = " ".join(text.split())
                if len(clean_text) > 50:  # Skip empty/header-only pages
                    pages_text.append({"page_num": i + 1, "text": clean_text})
        return pages_text
    except Exception as e:
        print(f"Error reading {file_path}: {e}")
        return []

def main():
    # 1. Setup
    model = load_model()
    
    # Encode the search phrases into vectors once
    phrase_embeddings = model.encode(search_phrases, convert_to_tensor=True)

    results = []
    
    # Get list of PDFs
    pdf_files = [f for f in os.listdir(PDF_DIRECTORY) if f.lower().endswith('.pdf')]
    
    print(f"Found {len(pdf_files)} PDFs. Starting analysis...")

    # 2. Iterate through PDFs
    for filename in tqdm(pdf_files, desc="Processing PDFs"):
        file_path = os.path.join(PDF_DIRECTORY, filename)
        
        # Extract text chunked by page
        # (We chunk by page because embedding a whole 50-page PDF at once 
        # dilutes the meaning of specific paragraphs)
        pages = extract_text_from_pdf(file_path)
        
        max_score_for_doc = 0
        best_phrase_match = ""
        relevant_pages = []

        for page in pages:
            # Encode the page text
            page_embedding = model.encode(page['text'], convert_to_tensor=True)
            
            # Compute cosine similarity between this page and ALL search phrases
            cosine_scores = util.cos_sim(page_embedding, phrase_embeddings)
            
            # Find the best matching phrase for this page
            best_score_idx = torch.argmax(cosine_scores)
            best_score = cosine_scores[0][best_score_idx].item()
            
            if best_score > max_score_for_doc:
                max_score_for_doc = best_score
                best_phrase_match = search_phrases[best_score_idx]

            # If this page triggers the threshold, record it
            if best_score > SIMILARITY_THRESHOLD:
                relevant_pages.append(page['page_num'])

        # 3. Store results if the document is relevant
        if max_score_for_doc > SIMILARITY_THRESHOLD:
            results.append({
                "File Name": filename,
                "Relevance Score": round(max_score_for_doc, 4),
                "Relevant Pages": str(relevant_pages),
                "Strongest Indicator": best_phrase_match
            })

    # 4. Output
    if results:
        df = pd.DataFrame(results)
        # Sort by highest relevance
        df = df.sort_values(by="Relevance Score", ascending=False)
        
        print("\n--- ANALYSIS COMPLETE ---")
        print(f"Identified {len(df)} potentially relevant hazard reports.\n")
        print(df.to_string(index=False))
        
        # Optional: Save to CSV
        df.to_csv("cbcs_hazard_analysis.csv", index=False)
        print("\nResults saved to 'cbcs_hazard_analysis.csv'")
    else:
        print("No reports found matching the CBCS criteria above the threshold.")

if __name__ == "__main__":
    main()

How this helps you

  1. Local & Secure: It uses all-MiniLM-L6-v2, which downloads once and runs entirely on your machine. No sensitive NASA data is sent to the cloud.

  2. Page-Level Granularity: Hazard reports can be long. If a report talks about mechanical structures for 40 pages and "software inhibits" on page 41, a whole-document scan might miss it. This script analyzes page-by-page to ensure specific safety-critical mentions are caught.

  3. Concept Matching: If the report says "The processor hung due to a memory leak," the script will likely flag it against the "Buffer overflow" or "Loss of processor control" phrases, even though the words are different.

Would you like me to modify this to extract the specific paragraph where the match occurred, to make the review process faster?


Please confirm that this application is strictly semantic search, no prompting an LLM.

Yes, I confirm: This is strictly semantic search. There is absolutely no "prompting" of a Generative LLM.

Here is the technical breakdown of why this is different from using GPT/Gemini/Claude:

1. Encoder vs. Decoder (Generative)

  • This Script (Encoder): It uses a model called all-MiniLM-L6-v2 (based on BERT). This is an Encoder-only architecture. It takes text input and compresses it into a fixed list of numbers (a vector embedding). It literally cannot generate text. It has no "next token prediction" capability.

  • LLMs (Decoder/Generative): Models like GPT-4 use a Decoder architecture. You "prompt" them with instructions, and they probabilistically generate new text response.

2. The Mechanism: Pure Math, No "Reasoning"

This script does not "read and think" about your document. It performs a mathematical operation:

  1. Vectorization: It turns your search phrase into a 384-dimensional vector (a coordinate in space).

  2. Comparison: It turns the PDF paragraph into a 384-dimensional vector.

  3. Cosine Similarity: It calculates the angle between those two vectors.

    • Small angle = High Semantic Similarity (The concepts are close).

    • Large angle = Low Semantic Similarity.

3. Data Privacy & Local Execution

  • No APIs: There are no calls to OpenAI, Anthropic, or Google.

  • Local Run: Once the small model file (~80MB) is downloaded to your cache folder, the script runs 100% offline on your local CPU/GPU. No data leaves your machine.

Summary

You are not asking a bot to "analyze this report." You are converting text to coordinates and measuring the distance between them. It is a search index, not an AI agent.

Would you like me to add a feature to highlight specifically where in the text the match occurred (context extraction) so you can verify the results manually?


Modify the script to extract the specific text snippet (the actual paragraph) surrounding the match

Here is the modified script.

Key Changes

  1. Chunking vs. Whole Page: The script now breaks each page into smaller "chunks" (overlapping text segments of ~120 words).

  2. Snippet Extraction: Instead of just flagging the page, it saves the specific text chunk that triggered the match.

  3. Precision: This improves accuracy because the semantic vector is generated from a specific paragraph rather than a whole page of mixed topics.

The Modified Python Script

Save this as hazard_search_snippets.py.

import os
import torch
import pandas as pd
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm

# --- CONFIGURATION ---
PDF_DIRECTORY = "./hazard_reports" 
SIMILARITY_THRESHOLD = 0.40  # Slightly higher since chunks are more precise

# Chunking settings (in words)
CHUNK_SIZE = 120   # Size of the text snippet to analyze
OVERLAP = 30       # Overlap between chunks to ensure we don't cut a sentence in half

# --- THE 25 SEMANTIC SEARCH PHRASES ---
search_phrases = [
    "Software classified as Class A or Class B safety critical",
    "System defined as a Computer Based Control System (CBCS)",
    "Compliance with NASA-STD-8719.13 software safety requirements",
    "Software residing in the safety critical control path",
    "Firmware or complex electronics performing safety functions",
    "Software commands initiating hazardous energy release",
    "Algorithm logic errors resulting in inadvertent actuation",
    "Closed-loop control systems monitoring safety parameters",
    "Autonomous decision making for hazard control",
    "Software bypass of hardware safety interlocks",
    "Loss of processor control resulting in uncommanded state change",
    "Latency in control loop processing violating timing requirements",
    "Race conditions in command execution sequences",
    "Buffer overflow causing system crash or unpredictable behavior",
    "Single Event Upset (SEU) causing bit flips in control logic",
    "Software failure to detect or isolate sensor malfunction",
    "Independent software inhibits preventing premature deployment",
    "Watchdog timer failure to reset processor during lockup",
    "Cyclic Redundancy Check (CRC) failures indicating data corruption",
    "Software transition to safe mode during anomaly detection",
    "Voter logic discrepancies in redundant computing strings",
    "Corrupted telemetry data causing incorrect automated response",
    "Bus traffic saturation preventing critical command transmission",
    "Incorrect parameter tables or configuration data loaded into flight computer",
    "Failure of fault management system to reconfiguration software"
]

def load_model():
    print("Loading Semantic Model...")
    return SentenceTransformer('all-MiniLM-L6-v2')

def chunk_text(text, chunk_size=CHUNK_SIZE, overlap=OVERLAP):
    """
    Splits text into sliding windows. 
    This helps isolate specific paragraphs for better semantic matching.
    """
    words = text.split()
    if len(words) <= chunk_size:
        return [text]
    
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        # Create a chunk of words
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

def extract_chunks_from_pdf(file_path):
    """Extracts text and converts it into analyze-able chunks."""
    try:
        reader = PdfReader(file_path)
        all_chunks = []
        
        for i, page in enumerate(reader.pages):
            text = page.extract_text()
            if text:
                clean_text = " ".join(text.split()) # Remove messy whitespace
                if len(clean_text) > 50:
                    # Break page into snippets
                    page_chunks = chunk_text(clean_text)
                    for chunk in page_chunks:
                        all_chunks.append({
                            "page_num": i + 1, 
                            "text": chunk
                        })
        return all_chunks
    except Exception as e:
        print(f"Error reading {file_path}: {e}")
        return []

def main():
    model = load_model()
    
    # Encode search phrases once
    phrase_embeddings = model.encode(search_phrases, convert_to_tensor=True)
    
    results = []
    pdf_files = [f for f in os.listdir(PDF_DIRECTORY) if f.lower().endswith('.pdf')]
    
    print(f"Found {len(pdf_files)} PDFs. Scanning for snippets...")

    for filename in tqdm(pdf_files, desc="Processing Files"):
        file_path = os.path.join(PDF_DIRECTORY, filename)
        
        # Get all text chunks (snippets) from the document
        chunks = extract_chunks_from_pdf(file_path)
        
        doc_max_score = 0
        best_match_data = None

        # Analyze every chunk in the document
        for chunk_data in chunks:
            chunk_embedding = model.encode(chunk_data['text'], convert_to_tensor=True)
            
            # Compare this snippet to all 25 phrases
            cosine_scores = util.cos_sim(chunk_embedding, phrase_embeddings)
            
            # Find best matching phrase for this specific snippet
            best_phrase_idx = torch.argmax(cosine_scores)
            score = cosine_scores[0][best_phrase_idx].item()
            
            # If this is the highest score found in this doc so far, track it
            if score > doc_max_score:
                doc_max_score = score
                best_match_data = {
                    "File Name": filename,
                    "Relevance Score": round(score, 4),
                    "Page": chunk_data['page_num'],
                    "Trigger Phrase": search_phrases[best_phrase_idx],
                    "Matched Snippet": chunk_data['text']
                }

        # Save if it meets the threshold
        if best_match_data and doc_max_score > SIMILARITY_THRESHOLD:
            results.append(best_match_data)

    # Output results
    if results:
        df = pd.DataFrame(results)
        df = df.sort_values(by="Relevance Score", ascending=False)
        
        # Save to CSV
        output_filename = "cbcs_snippet_analysis.csv"
        df.to_csv(output_filename, index=False)
        
        print("\n--- ANALYSIS COMPLETE ---")
        print(f"Top match found in: {df.iloc[0]['File Name']}")
        print(f"Snippet: \"{df.iloc[0]['Matched Snippet'][:100]}...\"")
        print(f"\nFull results saved to '{output_filename}'")
    else:
        print("No matches found above threshold.")

if __name__ == "__main__":
    main()

What the Output (cbcs_snippet_analysis.csv) looks like now

The Matched Snippet column will contain the actual paragraph text, allowing you to instantly verify context.

File NameRelevance ScorePageTrigger PhraseMatched Snippet
HR_LVP_04.pdf0.821512Software residing in the safety critical control path...However, the Flight Computer Software (FCS) resides in the critical control path for stage separation. A failure here could result in...
HR_GSE_22.pdf0.76018Independent software inhibits...design includes three independent inhibits, two mechanical and one software-based, to prevent inadvertent firing of the pyro...
HR_MECH_01.pdf0.412015Single Event Upset (SEU)...analysis of the avionics box indicates potential susceptibility to Single Event Upsets during high-radiation orbit passes, which may cause...

Why "Sliding Window"?

I used a sliding window (Overlap = 30 words).

  • Without Overlap: If the sentence was "The system failed because / [PAGE BREAK OR CHUNK BREAK] / the software crashed," the semantic meaning is split in half and might be missed.

  • With Overlap: The second chunk repeats the end of the first chunk, ensuring that complete sentences (and their full meaning) are always captured in at least one vector.


I believe this python application could be used with any list topic against any type of PDF documents

You are absolutely correct. You have essentially built a universal semantic scanner.

Because the underlying model (all-MiniLM-L6-v2) was trained on a massive dataset of general internet text (billions of sentence pairs from Reddit, Wikipedia, Stack Exchange, etc.), it understands the "conceptual relationships" between words in almost any domain of the English language.

It is not "hard-coded" to NASA safety; that was just the list of strings we gave it.

Why this is a powerful tool for a Solopreneur

As a solopreneur, you can reuse this exact architecture for different "products" or consulting tasks just by changing the input list.

Here are three distinct ways you could apply this same script tomorrow:

1. Contract Analysis (Legal Tech)

  • Target PDFs: Vendor contracts or NDAs.

  • Search Phrases:

    • "Clauses defining termination for convenience"

    • "Financial penalties for late delivery"

    • "Ownership of intellectual property created during the engagement"

  • Result: Instantly flags risky clauses without a lawyer reading 50 pages.

2. Competitor Intelligence (Market Research)

  • Target PDFs: Annual reports (10-Ks) or white papers from competitors.

  • Search Phrases:

    • "Investments in generative AI research"

    • "Supply chain disruptions affecting timeline"

    • "Strategic pivot toward software-as-a-service models"

  • Result: A distilled report of their strategic moves hidden in boring corporate filings.

3. Resume Filtering (HR Tech)

  • Target PDFs: A folder of 200 job applicant resumes.

  • Search Phrases:

    • "Experience leading agile software teams"

    • "Proficiency in Python and semantic search"

    • "Background in aerospace or safety-critical systems"

  • Result: It ranks candidates by actual experience rather than just keyword matching (e.g., it knows "managed a team of devs" is semantically similar to "leading agile teams").


Suggested Upgrade: Make it Generic

To make this a true "utility tool" where you never have to touch the Python code again, we should move the search phrases out of the script and into a simple text file (topics.txt).

Here is the slight modification to main() to make that happen:

# ... imports remain the same ...

def load_search_phrases(filename="topics.txt"):
    """Reads search phrases from a text file, one per line."""
    if not os.path.exists(filename):
        print(f"Error: Could not find {filename}. Please create it.")
        return []
    with open(filename, 'r') as f:
        # Read lines and strip whitespace
        phrases = [line.strip() for line in f.readlines() if line.strip()]
    return phrases

def main():
    # 1. Load the model
    model = load_model()
    
    # 2. Load phrases from EXTERNAL file instead of hardcoding
    search_phrases = load_search_phrases("topics.txt")
    
    if not search_phrases:
        print("No search phrases found. Exiting.")
        return

    print(f"Loaded {len(search_phrases)} search phrases from topics.txt...")
    
    # ... rest of the script (embedding generation and PDF scanning) is identical ...

Workflow:

  1. You create a folder for a new project (e.g., "Medical_Research").

  2. You drop your PDFs in there.

  3. You create a topics.txt file with your questions (e.g., "Side effects related to blood pressure").

  4. Run the script.

Would you like me to rewrite the full script with this "config file" approach so you can save it as a permanent tool in your library?


Yes

Here is the complete, reusable Universal Semantic Scanner.

This version separates the code from the data. You never have to edit the Python script again; you simply edit a text file called topics.txt to tell it what to look for.

The Architecture

This process relies on vector embeddings. The model maps your query and the document chunks into a multi-dimensional geometric space. If the "vectors" (arrows) point in the same direction, the concepts are related, regardless of the specific words used.

1. The Setup

Create a folder for your project (e.g., Universal_Scanner). Inside it, you need three things:

  1. The Script: Save the code below as scanner.py.

  2. The Input File: Create a file named topics.txt and paste your search phrases into it (one per line).

  3. The PDFs: Create a folder named docs and put your PDF files inside.

2. The Universal Script (scanner.py)

import os
import sys
import torch
import pandas as pd
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm

# --- CONFIGURATION ---
# The folder containing your PDFs
PDF_DIRECTORY = "./docs" 

# The file containing your search phrases (one per line)
TOPICS_FILE = "topics.txt"

# Output filename
OUTPUT_FILE = "search_results.csv"

# Semantic Similarity Threshold (0.0 to 1.0)
# 0.40 is a good balance for finding relevant paragraphs without too much noise.
SIMILARITY_THRESHOLD = 0.40 

# Chunking settings (in words)
# We use a sliding window to capture context across sentence boundaries.
CHUNK_SIZE = 120   
OVERLAP = 30       

def load_search_phrases(filename):
    """Reads search phrases from a text file, one per line."""
    if not os.path.exists(filename):
        print(f"Error: Could not find '{filename}'.")
        print(f"Please create a file named '{filename}' and add your search phrases.")
        sys.exit(1)
        
    with open(filename, 'r', encoding='utf-8') as f:
        # Read lines, strip whitespace, and ignore empty lines
        phrases = [line.strip() for line in f.readlines() if line.strip()]
    
    if not phrases:
        print(f"Error: '{filename}' is empty. Please add some search phrases.")
        sys.exit(1)
        
    return phrases

def load_model():
    """Loads the encoder model. Runs locally."""
    print("Loading Semantic Model (all-MiniLM-L6-v2)...")
    # This downloads the model once (~80MB) and caches it locally.
    return SentenceTransformer('all-MiniLM-L6-v2')

def chunk_text(text, chunk_size=CHUNK_SIZE, overlap=OVERLAP):
    """Splits text into sliding windows."""
    words = text.split()
    if len(words) <= chunk_size:
        return [text]
    
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

def extract_chunks_from_pdf(file_path):
    """Reads a PDF and converts it into analyzable text chunks."""
    try:
        reader = PdfReader(file_path)
        all_chunks = []
        
        for i, page in enumerate(reader.pages):
            text = page.extract_text()
            if text:
                # Basic cleaning to fix line breaks and spacing
                clean_text = " ".join(text.split()) 
                
                # Skip pages that are too short (likely blank or just headers)
                if len(clean_text) > 50:
                    page_chunks = chunk_text(clean_text)
                    for chunk in page_chunks:
                        all_chunks.append({
                            "page_num": i + 1, 
                            "text": chunk
                        })
        return all_chunks
    except Exception as e:
        print(f"Warning: Could not read {file_path}. Error: {e}")
        return []

def main():
    # 1. Validation
    if not os.path.exists(PDF_DIRECTORY):
        print(f"Error: Directory '{PDF_DIRECTORY}' not found.")
        print("Please create this folder and add your PDFs.")
        sys.exit(1)

    # 2. Load Inputs
    search_phrases = load_search_phrases(TOPICS_FILE)
    print(f"Loaded {len(search_phrases)} phrases from {TOPICS_FILE}.")
    
    model = load_model()
    
    # 3. Vectorize Search Phrases (Encode once, compare many times)
    phrase_embeddings = model.encode(search_phrases, convert_to_tensor=True)
    
    results = []
    pdf_files = [f for f in os.listdir(PDF_DIRECTORY) if f.lower().endswith('.pdf')]
    
    if not pdf_files:
        print(f"No PDFs found in '{PDF_DIRECTORY}'.")
        sys.exit(1)

    print(f"Found {len(pdf_files)} PDFs. Scanning contents...")

    # 4. Scan Documents
    for filename in tqdm(pdf_files, desc="Scanning"):
        file_path = os.path.join(PDF_DIRECTORY, filename)
        chunks = extract_chunks_from_pdf(file_path)
        
        doc_max_score = 0
        best_match_data = None

        # Compare every chunk in the document against every search phrase
        for chunk_data in chunks:
            chunk_embedding = model.encode(chunk_data['text'], convert_to_tensor=True)
            
            # Calculate cosine similarity
            cosine_scores = util.cos_sim(chunk_embedding, phrase_embeddings)
            
            # Identify the best matching phrase for this chunk
            best_phrase_idx = torch.argmax(cosine_scores)
            score = cosine_scores[0][best_phrase_idx].item()
            
            # If this is the most relevant chunk found in this file so far...
            if score > doc_max_score:
                doc_max_score = score
                best_match_data = {
                    "File Name": filename,
                    "Relevance Score": round(score, 4),
                    "Page": chunk_data['page_num'],
                    "Trigger Phrase": search_phrases[best_phrase_idx],
                    "Matched Snippet": chunk_data['text']
                }

        # Save result if it meets the threshold
        if best_match_data and doc_max_score > SIMILARITY_THRESHOLD:
            results.append(best_match_data)

    # 5. Output Results
    if results:
        df = pd.DataFrame(results)
        df = df.sort_values(by="Relevance Score", ascending=False)
        
        df.to_csv(OUTPUT_FILE, index=False)
        
        print("\n" + "="*40)
        print("SCAN COMPLETE")
        print("="*40)
        print(f"Found matches in {len(df)} documents.")
        print(f"Top Result: {df.iloc[0]['File Name']} (Score: {df.iloc[0]['Relevance Score']})")
        print(f"Context: \"{df.iloc[0]['Matched Snippet'][:100]}...\"")
        print(f"\nDetailed results saved to: {OUTPUT_FILE}")
    else:
        print("\nNo matches found above the similarity threshold.")

if __name__ == "__main__":
    main()

3. How to Use It (Example Workflows)

Because you have separated logic from data, you can now switch contexts instantly:

Scenario A: NASA Hazard Analysis

  1. Folder: Put Hazard Reports in ./docs

  2. topics.txt:

    Software classified as Class A or Class B safety critical
    Independent software inhibits preventing premature deployment
    Single Event Upset (SEU) causing bit flips in control logic
    
  3. Run: python scanner.py

Scenario B: Finding Business Opportunities (Solopreneur)

  1. Folder: Put government RFP (Request for Proposal) PDFs in ./docs

  2. topics.txt:

    Requirement for knowledge graph implementation
    Need for legacy system modernization using Python
    Automated analysis of safety documentation
    Funding budget for AI and Machine Learning pilot programs
    
  3. Run: python scanner.py -> The CSV will point you exactly to the RFPs where your specific skills are being requested.

Next Step

Since you are a "40-year software veteran," you might eventually want to turn this into a command-line tool (CLI) so you can run it from anywhere without moving files around (e.g., scan-pdfs ./my-folder --topics my-topics.txt).

Would you like me to show you how to wrap this in argparse to make it a proper CLI utility?


Sure

This is a significant upgrade. By wrapping the logic in argparse, you transform a "script" into a "tool." You can now run this against any folder and any topic list without ever opening the Python code again.

The CLI-Ready Script (scanner.py)

Save this file. It now accepts "flags" (like -d for directory or -t for topics) directly from your terminal.

import os
import sys
import argparse
import torch
import pandas as pd
from pypdf import PdfReader
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm

def parse_arguments():
    """Defines the command-line arguments for the tool."""
    parser = argparse.ArgumentParser(
        description="Universal Semantic Scanner: Search PDFs using vector embeddings.",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )
    
    parser.add_argument(
        "-d", "--dir", 
        type=str, 
        default="./docs",
        help="Path to the directory containing PDF files."
    )
    
    parser.add_argument(
        "-t", "--topics", 
        type=str, 
        default="topics.txt",
        help="Path to the text file containing search phrases (one per line)."
    )
    
    parser.add_argument(
        "-o", "--output", 
        type=str, 
        default="results.csv",
        help="Filename for the output CSV report."
    )
    
    parser.add_argument(
        "--threshold", 
        type=float, 
        default=0.40,
        help="Similarity threshold (0.0 to 1.0). Lower = more results, Higher = strictly relevant."
    )

    parser.add_argument(
        "--chunk-size", 
        type=int, 
        default=120,
        help="Number of words per text chunk."
    )

    return parser.parse_args()

def load_search_phrases(filename):
    if not os.path.exists(filename):
        print(f"Error: Topics file '{filename}' not found.")
        sys.exit(1)
        
    with open(filename, 'r', encoding='utf-8') as f:
        phrases = [line.strip() for line in f.readlines() if line.strip()]
    
    if not phrases:
        print(f"Error: Topics file '{filename}' is empty.")
        sys.exit(1)
    return phrases

def chunk_text(text, chunk_size, overlap=30):
    words = text.split()
    if len(words) <= chunk_size:
        return [text]
    
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

def extract_chunks_from_pdf(file_path, chunk_size):
    try:
        reader = PdfReader(file_path)
        all_chunks = []
        for i, page in enumerate(reader.pages):
            text = page.extract_text()
            if text:
                clean_text = " ".join(text.split())
                if len(clean_text) > 50:
                    page_chunks = chunk_text(clean_text, chunk_size)
                    for chunk in page_chunks:
                        all_chunks.append({"page_num": i + 1, "text": chunk})
        return all_chunks
    except Exception as e:
        print(f"Warning: Could not read {file_path}. Error: {e}")
        return []

def main():
    args = parse_arguments()

    # 1. Validation
    if not os.path.exists(args.dir):
        print(f"Error: Directory '{args.dir}' not found.")
        sys.exit(1)

    # 2. Setup
    search_phrases = load_search_phrases(args.topics)
    print(f"Loaded {len(search_phrases)} phrases from '{args.topics}'")
    
    print("Loading model (all-MiniLM-L6-v2)...")
    model = SentenceTransformer('all-MiniLM-L6-v2')
    phrase_embeddings = model.encode(search_phrases, convert_to_tensor=True)
    
    results = []
    pdf_files = [f for f in os.listdir(args.dir) if f.lower().endswith('.pdf')]
    
    if not pdf_files:
        print(f"No PDFs found in '{args.dir}'.")
        sys.exit(1)

    # 3. Execution
    print(f"Scanning {len(pdf_files)} files...")
    for filename in tqdm(pdf_files, desc="Progress"):
        file_path = os.path.join(args.dir, filename)
        chunks = extract_chunks_from_pdf(file_path, args.chunk_size)
        
        doc_max_score = 0
        best_match_data = None

        for chunk_data in chunks:
            chunk_embedding = model.encode(chunk_data['text'], convert_to_tensor=True)
            cosine_scores = util.cos_sim(chunk_embedding, phrase_embeddings)
            
            best_phrase_idx = torch.argmax(cosine_scores)
            score = cosine_scores[0][best_phrase_idx].item()
            
            if score > doc_max_score:
                doc_max_score = score
                best_match_data = {
                    "File Name": filename,
                    "Relevance Score": round(score, 4),
                    "Page": chunk_data['page_num'],
                    "Trigger Phrase": search_phrases[best_phrase_idx],
                    "Matched Snippet": chunk_data['text']
                }

        if best_match_data and doc_max_score > args.threshold:
            results.append(best_match_data)

    # 4. Reporting
    if results:
        df = pd.DataFrame(results)
        df = df.sort_values(by="Relevance Score", ascending=False)
        df.to_csv(args.output, index=False)
        print(f"\nSuccess! Found {len(df)} matches.")
        print(f"Results saved to: {args.output}")
    else:
        print("\nNo matches found above the threshold.")

if __name__ == "__main__":
    main()

How to use your new tool

You no longer edit the code. You control it entirely from the command line.

1. The "Default" Run If your files are in ./docs and your list is topics.txt, just type:

python scanner.py

2. The "Project-Specific" Run If you have a folder of NASA contracts and a specific list of queries for them:

python scanner.py --dir ./nasa_contracts --topics contract_risks.txt --output nasa_risk_report.csv

3. Adjusting Sensitivity If you are getting too much junk, raise the threshold (default is 0.40):

python scanner.py --threshold 0.65

Pro Tip: Make it a System Command (Windows/Linux/Mac)

To run this from any folder without copying the script, you can create a simple "alias" or batch file.

For Windows (PowerShell):

  1. Create a folder C:\Tools and put scanner.py there.

  2. Open your PowerShell profile (notepad $PROFILE).

  3. Add this line:

    PowerShell
    function scan-pdfs { python C:\Tools\scanner.py $args }
    
  4. Restart PowerShell.

  5. Now, anywhere in your computer, you can just type:

    PowerShell
    scan-pdfs -d . -t my_questions.txt
    

Next Step

Would you like me to help you create a visualization script (using matplotlib) that reads your output CSV and creates a bar chart showing which hazards or topics appear most frequently across your document set?

Here is a video guide on building CLI tools in Python: Python argparse Tutorial for Beginners This video is relevant because it provides a step-by-step visual guide to understanding argparse, which is the core library used to transform your script into the command-line tool above.


From <https://gemini.google.com/app/>  Google Gemini 3.0 Pro (Thinking)

Comments

Popular posts from this blog

#2892 Example of ChatGPT o1 Prompt

#3105 Federal and State Spending Idiocy