#4226 PDF-to-ChromaDB with Tkinter GUI

#4226 Provide a Python application to read a PDF and convert it to text, chunk and index the text chunks, and save it in a ChromaDB database

Building a local RAG (Retrieval-Augmented Generation) pipeline is a great way to handle your documents. To get this running, we'll use LangChain for the orchestration, PyPDF for parsing, and ChromaDB as our vector warehouse.

🛠️ Prerequisites

You'll need to install a few libraries. I recommend using a virtual environment:

pip install langchain langchain-community langchain-chroma pypdf sentence-transformers

🐍 The Python Application

This script handles the entire pipeline: loading the PDF, breaking it into manageable pieces, and storing it in a persistent database.

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma

def process_pdf_to_chroma(pdf_path, db_directory="./chroma_db"):
    # 1. Load the PDF
    print(f"--- Loading: {pdf_path} ---")
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()

    # 2. Chunk the Text
    # We use chunks of 1000 characters with a 10% overlap to maintain context
    print("--- Splitting text into chunks ---")
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, 
        chunk_overlap=100,
        add_start_index=True
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Created {len(chunks)} text chunks.")

    # 3. Initialize Embedding Model
    # Using a reliable, local open-source model from HuggingFace
    print("--- Initializing Embeddings ---")
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

    # 4. Create and Save to ChromaDB
    print(f"--- Indexing to ChromaDB at {db_directory} ---")
    vector_db = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=db_directory
    )
    
    print("--- Success! Database is ready. ---")
    return vector_db

if __name__ == "__main__":
    # Replace with your actual PDF file path
    MY_PDF = "example_document.pdf"
    
    if os.path.exists(MY_PDF):
        db = process_pdf_to_chroma(MY_PDF)
        
        # Quick test: Search the database
        query = "What is the main topic of this document?"
        results = db.similarity_search(query, k=2)
        
        print(f"\nTop Search Result for '{query}':")
        print(results[0].page_content[:200] + "...")
    else:
        print(f"File {MY_PDF} not found. Please provide a valid PDF path.")

💡 Key Components Explained

Component	Purpose
PyPDFLoader	Handles the heavy lifting of extracting raw strings from PDF pages.
RecursiveCharacterTextSplitter	Smarter than a basic split; it tries to keep paragraphs and sentences together so the AI doesn't lose the "plot."
HuggingFaceEmbeddings	This runs locally on your CPU/GPU. It converts text into a vector (a list of numbers) that represents the semantic meaning.
Chroma	The vector database. It stores those numbers so you can search by meaning rather than just keywords.

⚠️ A Quick Note on Persistence

In the code above, I set a persist_directory. This ensures that your data stays on your hard drive even after the script finishes. If you want to query the database later without re-processing the PDF, you can simply load it like this:

db = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)

Would you like me to show you how to add a Large Language Model (like GPT-4 or Llama 3) to this script so you can actually "chat" with your PDF?

Add a Tkinter GUI with prompt and top_k inputs to search the ChromaDB

Adding a GUI is a great move—it turns a script into a real tool. We'll use Tkinter for the interface and wrap our previous logic so you can search the database with a single click.

I’ve updated the script to include a search bar, a "Top K" (number of results) selector, and a scrollable text area for the answers.

🐍 The Integrated Application

import tkinter as tk
from tkinter import ttk, messagebox
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma

class ChromaSearchApp:
    def __init__(self, root, db_path="./chroma_db"):
        self.root = root
        self.root.title("PDF Semantic Search (ChromaDB)")
        self.root.geometry("600x500")

        # Initialize Embeddings and Load DB
        # Note: Ensure the DB exists before running this, or point to your path
        try:
            self.embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
            self.db = Chroma(persist_directory=db_path, embedding_function=self.embeddings)
        except Exception as e:
            messagebox.showerror("Error", f"Could not load database: {e}")

        self.setup_ui()

    def setup_ui(self):
        # --- Search Frame ---
        search_frame = ttk.Frame(self.root, padding="10")
        search_frame.pack(fill="x")

        ttk.Label(search_frame, text="Ask your PDF:").pack(side="left")
        self.query_entry = ttk.Entry(search_frame, width=50)
        self.query_entry.pack(side="left", padx=5, expand=True, fill="x")
        self.query_entry.bind("<Return>", lambda e: self.perform_search())

        # --- Top K Input ---
        ttk.Label(search_frame, text="Top K:").pack(side="left", padx=(10, 2))
        self.top_k_var = tk.StringVar(value="3")
        self.top_k_entry = ttk.Entry(search_frame, textvariable=self.top_k_var, width=5)
        self.top_k_entry.pack(side="left")

        # --- Search Button ---
        self.search_btn = ttk.Button(search_frame, text="Search", command=self.perform_search)
        self.search_btn.pack(side="left", padx=10)

        # --- Results Area ---
        ttk.Label(self.root, text="Results:", font=("Arial", 10, "bold")).pack(anchor="w", padx=10)
        
        self.result_text = tk.Text(self.root, wrap="word", padx=10, pady=10)
        self.result_text.pack(expand=True, fill="both", padx=10, pady=(0, 10))

    def perform_search(self):
        query = self.query_entry.get()
        if not query:
            return

        try:
            k = int(self.top_k_var.get())
        except ValueError:
            messagebox.showwarning("Warning", "Top K must be a number!")
            return

        # Clear previous results
        self.result_text.delete("1.0", tk.END)
        self.result_text.insert(tk.END, f"Searching for: {query}...\n\n")

        # Query ChromaDB
        results = self.db.similarity_search(query, k=k)

        if not results:
            self.result_text.insert(tk.END, "No matching documents found.")
            return

        for i, doc in enumerate(results):
            header = f"--- Result {i+1} (Source: {doc.metadata.get('source', 'Unknown')}) ---\n"
            self.result_text.insert(tk.END, header, "bold")
            self.result_text.insert(tk.END, f"{doc.page_content}\n\n")
        
        # Adding a bit of style
        self.result_text.tag_configure("bold", font=("Arial", 10, "bold"))

if __name__ == "__main__":
    root = tk.Tk()
    # Path should match where you saved your DB in the previous step
    app = ChromaSearchApp(root, db_path="./chroma_db")
    root.mainloop()

🕹️ How it Works

HuggingFaceEmbeddings: This is initialized when the app starts. It must be the same model you used to create the database, otherwise, the math won't align and the search will return gibberish.
Chroma Initialization: Instead of .from_documents(), we initialize it by passing the persist_directory. This tells Chroma, "Don't create a new index, just open the one already sitting on my drive."
Similarity Search: When you click "Search," it converts your query into a vector and finds the $k$ closest matches in the database.
Top K: This control lets you decide if you want just the single best answer or a broader list of relevant snippets.

💡 Pro Tip for Performance

If your PDF is massive, the search is fast, but loading the HuggingFaceEmbeddings model for the first time might take a few seconds. If the window feels "frozen" during startup, that’s just the model downloading or loading into RAM.

Would you like me to add a button to the GUI that allows you to select and upload a new PDF file directly through the interface?

Add both a copy to clipboard and clear results buttons to the Tkinter GUI

Great idea—those small quality-of-life features make a huge difference when you're actually using the tool to extract data.

I've updated the UI to include a button bar below the search area. The "Copy" button will grab whatever text is currently visible in the results window, and "Clear" will wipe the slate clean.

🐍 Updated Python Application

import tkinter as tk
from tkinter import ttk, messagebox
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma

class ChromaSearchApp:
    def __init__(self, root, db_path="./chroma_db"):
        self.root = root
        self.root.title("PDF Semantic Search (ChromaDB)")
        self.root.geometry("700x600")

        # Initialize Embeddings and Load DB
        try:
            # Note: Using the same model used for indexing is crucial
            self.embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
            self.db = Chroma(persist_directory=db_path, embedding_function=self.embeddings)
        except Exception as e:
            messagebox.showerror("Error", f"Could not load database: {e}")

        self.setup_ui()

    def setup_ui(self):
        # --- Top Search Frame ---
        search_frame = ttk.Frame(self.root, padding="15")
        search_frame.pack(fill="x")

        ttk.Label(search_frame, text="Ask your PDF:", font=("Arial", 10, "bold")).pack(side="left")
        self.query_entry = ttk.Entry(search_frame, width=40)
        self.query_entry.pack(side="left", padx=5, expand=True, fill="x")
        self.query_entry.bind("<Return>", lambda e: self.perform_search())

        ttk.Label(search_frame, text="Top K:").pack(side="left", padx=(10, 2))
        self.top_k_var = tk.StringVar(value="3")
        self.top_k_entry = ttk.Entry(search_frame, textvariable=self.top_k_var, width=5)
        self.top_k_entry.pack(side="left")

        self.search_btn = ttk.Button(search_frame, text="🔍 Search", command=self.perform_search)
        self.search_btn.pack(side="left", padx=10)

        # --- Action Button Bar (Copy/Clear) ---
        button_bar = ttk.Frame(self.root, padding="5")
        button_bar.pack(fill="x")
        
        self.copy_btn = ttk.Button(button_bar, text="📋 Copy to Clipboard", command=self.copy_to_clipboard)
        self.copy_btn.pack(side="right", padx=5)
        
        self.clear_btn = ttk.Button(button_bar, text="🗑️ Clear Results", command=self.clear_results)
        self.clear_btn.pack(side="right", padx=5)

        # --- Results Area ---
        ttk.Label(self.root, text="Results:", font=("Arial", 10, "italic")).pack(anchor="w", padx=15)
        
        # Adding a Scrollbar to the Text Area
        text_frame = ttk.Frame(self.root, padding="10")
        text_frame.pack(expand=True, fill="both")
        
        self.scrollbar = ttk.Scrollbar(text_frame)
        self.scrollbar.pack(side="right", fill="y")

        self.result_text = tk.Text(
            text_frame, 
            wrap="word", 
            padx=10, 
            pady=10, 
            yscrollcommand=self.scrollbar.set,
            font=("Segoe UI", 10)
        )
        self.result_text.pack(expand=True, fill="both")
        self.scrollbar.config(command=self.result_text.yview)
        
        # Style Tag for Headers
        self.result_text.tag_configure("header", font=("Arial", 10, "bold"), foreground="#2c3e50")

    def perform_search(self):
        query = self.query_entry.get()
        if not query:
            return

        try:
            k = int(self.top_k_var.get())
        except ValueError:
            messagebox.showwarning("Warning", "Top K must be an integer!")
            return

        self.clear_results()
        
        # Query ChromaDB
        results = self.db.similarity_search(query, k=k)

        if not results:
            self.result_text.insert(tk.END, "No matching document chunks found.")
            return

        for i, doc in enumerate(results):
            source = doc.metadata.get('source', 'Unknown')
            page = doc.metadata.get('page', '?')
            
            header = f"--- Result {i+1} | Source: {source} | Page: {page} ---\n"
            self.result_text.insert(tk.END, header, "header")
            self.result_text.insert(tk.END, f"{doc.page_content}\n\n")

    def copy_to_clipboard(self):
        content = self.result_text.get("1.0", tk.END).strip()
        if content:
            self.root.clipboard_clear()
            self.root.clipboard_append(content)
            # Temporary notification in the status bar or just a flash? 
            # Let's use a subtle message
            messagebox.showinfo("Copied", "Results copied to clipboard!")

    def clear_results(self):
        self.result_text.delete("1.0", tk.END)

if __name__ == "__main__":
    root = tk.Tk()
    # Make sure this matches the directory where your PDF was indexed
    app = ChromaSearchApp(root, db_path="./chroma_db")
    root.mainloop()

✨ What's New

Clipboard Integration: Uses self.root.clipboard_append() to push the current text view directly into your OS clipboard.
Results Clearing: A simple method that clears the text widget from index 1.0 (start) to END.
Scrollbar: Added a proper ttk.Scrollbar because if your Top K is high, you'll definitely need to scroll through those chunks.
Metadata Display: It now looks for "Page" numbers in the metadata, which LangChain's PyPDFLoader usually includes automatically.

Would you like me to help you package this into a standalone .exe file so you can use it without needing a Python terminal open?

How do I display either the cosine_similarity score or distance to each of the Top_K results?

To display the scores, we need to change the method from similarity_search to similarity_search_with_score.

A quick heads-up: ChromaDB typically uses L2 distance (Squared Euclidean) by default, where a lower score means the results are more similar (closer together in vector space). If you configured your index for Cosine Similarity, the score will range from 0 to 2, where 0 is an exact match.

🐍 The Updated Logic

I have updated the perform_search method to unpack the score and display it in the header of each result.

    def perform_search(self):
        query = self.query_entry.get()
        if not query:
            return

        try:
            k = int(self.top_k_var.get())
        except ValueError:
            messagebox.showwarning("Warning", "Top K must be an integer!")
            return

        self.clear_results()
        
        # CHANGED: Use similarity_search_with_score instead of similarity_search
        # This returns a list of tuples: (Document, Score)
        results_with_scores = self.db.similarity_search_with_score(query, k=k)

        if not results_with_scores:
            self.result_text.insert(tk.END, "No matching document chunks found.")
            return

        for i, (doc, score) in enumerate(results_with_scores):
            source = doc.metadata.get('source', 'Unknown')
            page = doc.metadata.get('page', '?')
            
            # Format the score to 4 decimal places
            # Note: Lower score = higher similarity in default L2 distance
            header = f"--- Result {i+1} | Score (Dist): {score:.4f} | Page: {page} ---\n"
            
            self.result_text.insert(tk.END, header, "header")
            self.result_text.insert(tk.END, f"{doc.page_content}\n\n")

🧠 Understanding the Score

Depending on how you initialized your Chroma collection, the "Score" represents different things:

L2 (Euclidean) Distance (Default): The score is the distance between vectors. $0.0$ is a perfect match. Numbers typically range from $0.0$ to $1.5+$ .
Cosine Similarity: If you set collection_metadata={"hnsw:space": "cosine"} during creation, the "score" returned is actually Cosine Distance ( $1 - \text{similarity}$ ).
- $0.0$ = Identical
- $1.0$ = Orthogonal (no similarity)
- $2.0$ = Opposite

🛠️ Optional: Convert Distance to "Certainty"

If you want to show a "Similarity %" instead of a raw distance (which can be confusing for users), you can use a simple normalization. For L2 distance, a common (though rough) approximation for display is:

\text{Similarity} = \max(0, 1 - \text{score})

Would you like me to add a "Threshold" slider to the GUI so it automatically hides results that have a distance score higher than a certain value?

What are considered good distance scores?

Because ChromaDB can use different "math" (distance metrics) to compare text, a "good" score depends entirely on which metric your database is using.

Since we are using HuggingFace all-MiniLM-L6-v2, these vectors are normalized (they have a length of 1.0). This makes interpreting the scores much easier.

📏 1. If you are using the Default (L2 / Squared Euclidean)

In Chroma, the default is l2. Because the vectors are normalized, the distance and similarity are mathematically linked.

0.0 to 0.5: Excellent. This is a very strong semantic match. The text likely contains the exact answer or very similar phrasing.
0.5 to 1.0: Good/Fair. The content is related to your topic but might be a partial match or contain "noisy" surrounding text.
1.0 to 1.4: Poor. These are likely "stretches." The database is showing you the least-bad options it could find.
> 1.5: Irrelevant. Usually just random noise.

📐 2. If you switched to Cosine Similarity

If you initialized your collection with {"hnsw:space": "cosine"}, Chroma returns Cosine Distance ( $1 - \text{similarity}$ ).

0.0 to 0.2: Strong Match. (Similarity of 80% to 100%).
0.2 to 0.4: Relevant. (Similarity of 60% to 80%).
> 0.5: Weak/Irrelevant.

🛠️ Pro-Tip: Adding a "Certainty" Label to your GUI

Instead of making the user guess if 0.3421 is good, you can add a simple helper function in your Python code to label the results.

def get_relevance_label(score):
    # Assuming default L2 distance with normalized vectors
    if score < 0.4:
        return "🔥 High Match"
    elif score < 0.8:
        return "✅ Relevant"
    elif score < 1.2:
        return "❓ Partial"
    else:
        return "🚫 Low Quality"

# In your perform_search loop:
label = get_relevance_label(score)
header = f"--- Result {i+1} | {label} | Dist: {score:.4f} ---\n"

⚠️ The "Golden Rule" of Scores

Distance scores are relative, not absolute.

A "0.4" score for a specific query like "What is the revenue in 2023?" is much more reliable than a "0.4" score for a vague query like "Tell me stuff." * Always look at the gap between Result 1 and Result 2. If Result 1 is 0.2 and Result 2 is 0.9, you can be very confident in the first result.

Would you like me to show you how to filter out results automatically if they fall below a certain "good" score?

Tkinter GUI Query: Tell me about safety-critical software, top-k = 5

--- Result 1 | Score (Dist): 0.6513 | Page: 63 ---

3.7.3 134 If a project has safety-critical

software or mission-critical

software, the project manager

shall implement the following

items in the software:

a. The software is initialized, at

first start and restarts, to a

known safe state.

b. The software safely

transitions between all

predefined known states.

c. Termination performed by

software functions is

performed to a known safe

state.

d. Operator overrides of

software functions require at

least two independent actions

by an operator.

e. Software rejects commands

received out of sequence when

execution of those commands

out of sequence can cause a

hazard.

f. The software detects

inadvertent memory

modification and recovers to a

known safe state.

g. The software performs

integrity checks on inputs and

outputs to/from the software

system.

h. The software performs

prerequisite checks prior to the

execution of safety-critical

software commands.

i. No single software event or

action is allowed to initiate an

identified hazard.

--- Result 2 | Score (Dist): 0.6785 | Page: 25 ---

implement the following items in the software: [SWE-134]

a. The software is initialized, at first start and restarts, to a known safe state.

b. The software safely transitions between all predefined known states.

c. Termination performed by software functions is performed to a known safe state.

d. Operator overrides of software functions require at least two independent actions by an operator.

e. Software rejects commands received out of sequence when execution of those commands out of

sequence can cause a hazard.

f. The software detects inadvertent memory modification and recovers to a known safe state.

g. The software performs integrity checks on inputs and outputs to/from the software system.

h. The software performs prerequisite checks prior to the execution of safety-critical software

commands.

i. No single software event or action is allowed to initiate an identified hazard.

--- Result 3 | Score (Dist): 0.7141 | Page: 25 ---

non-pdf) and, where possible, direct read-only electronic access to project document

repositories and data stores should be provided. Appropriate security products should be

completed and transferred as part of the overall package.

3.6.5 If software IV&V is performed on a project, the project manager shall provide responses to

IV&V submitted issues and risks and track these issues and risks to closure. [SWE-179]

3.7 Safety-Critical Software

3.7.1 The project manager, in conjunction with the SMA organization, shall determine if each

software component is considered to be safety-critical per the criteria defined in

NASA-STD-8739.8. [SWE-205]

3.7.2 If a project has safety-critical software, the project manager shall implement the safety-critical

software requirements contained in NASA-STD-8739.8. [SWE-023]

3.7.3 If a project has safety-critical software or mission-critical software, the project manager shall

implement the following items in the software: [SWE-134]

--- Result 4 | Score (Dist): 0.7260 | Page: 64 ---

handling.

l. The software can place the

system into a safe state.

3.7.4 219 If a project has safety-critical

software, the project manager

shall ensure that there is 100

percent code test coverage

using the Modified

Condition/Decision Coverage

(MC/DC) criterion for all

identified safety-critical

software components.

Center X X X X

3.7.5 220 If a project has safety-critical

software, the project manager

shall ensure all identified

safety-critical software

components have a cyclomatic

complexity value of 15 or

lower. Any exceedance shall

be reviewed and waived with

rationale by the project

manager or technical approval

authority.

Center X X X X

3.8 Automatic Generation of Software Source Code

3.8.1 146 The project manager shall

define the approach to the

automatic generation of

software source code including:

a. Validation and verification

of auto-generation tools.

b. Configuration management

of the auto-generation tools

and associated data.

c. Description of the limits and

--- Result 5 | Score (Dist): 0.8136 | Page: 34 ---

Note: For safety critical software, the unit testing should follow the requirement

established in 3.7.4 of this document.

4.4.6 The project manager shall assure that the unit test results are repeatable. [SWE-186]

4.4.7 The project manager shall provide a software version description for each software release.

[SWE-063]

4.4.8 The project manager shall validate and accredit the software tool(s) required to develop or

maintain software. [SWE-136]

Note: All software development tools contain some number of software defects. Validation

and accreditation of the critical software development and maintenance tools ensure that

the tools being used during the software development life cycle do not generate or insert

errors in the software executable components. Software tool accreditation is the

certification that a software tool is acceptable for use for a specific purpose. Accreditation

is conferred by the organization best positioned to make the judgment that the software tool

Change gears. Let's try some search results. How well does the following "Tell me about safety-critical software" from NASA NPR 7150.2D: """--- Result 1 | Score (Dist): 0.6513 | Page: 63 --- 3.7.3 134 If a project has safety-critical......"""

These results are a perfect case study for how RAG (Retrieval-Augmented Generation) behaves with technical documentation like NASA standards.

Here is an analysis of how your search performed, based on the distance scores and the content quality:

📊 The "Goodness" of the Scores

Your scores are in the $0.65$ to $0.81$ range.

Results 1 & 2 ( $0.65$ – $0.67$ ): These are "Solid Hits." In a normalized L2 space, anything under $0.7$ is generally considered a high-confidence match. You can see why: they contain the actual checklist for safety-critical implementation.
Result 5 ( $0.81$ ): This is a "Weak/Marginal Hit." It captures the phrase "safety critical software," but the actual content is about unit testing and version descriptions. The higher score correctly signals that this is less relevant to your general inquiry.

🔍 Content Observations

1. The "Redundancy" Effect (Results 1 & 2)

Notice that Result 1 and Result 2 are nearly identical.

Why? NASA documents often have a "Requirements Traceability Matrix" or an appendix that repeats the main body text.
The Fix: If you were building a chatbot, you’d want to use Maximal Marginal Relevance (MMR) in your search. MMR penalizes results that are too similar to ones already found, which would have skipped Result 2 to give you more "diverse" info instead.

2. The "Context Gap" (Result 3)

Result 3 actually contains the definitions and the "who/how" (SMA organization, NASA-STD-8739.8).

Even though it has a higher distance score ( $0.71$ ) than the checklist ( $0.65$ ), it is arguably more "informative" for a high-level question.
Lesson: This is why we usually pull Top 3 or Top 5. The "best" vector match is often a specific list, while the "second-best" often provides the necessary context.

3. The MC/DC Gem (Result 4)

Result 4 brings in the "meat" of the technical requirements: MC/DC test coverage and Cyclomatic Complexity < 15.

This is a "Good" result ( $0.72$ ). It tells us that safety-critical isn't just about safe states (Result 1), but also about strict code quality metrics.

✅ Verdict: How well did it do?

Extremely well. If you were to summarize these results, you would have a complete picture:

Logic: Must have safe states, input checks, and no single-point failures (Results 1/2).
Compliance: Must follow NASA-STD-8739.8 and work with SMA (Result 3).
Metrics: Must have 100% MC/DC coverage and low complexity (Result 4).

🛠️ One Suggested Tweak

You might notice some "junk" characters or weird line breaks (e.g., 3.7.3 134). This happens because PDF loaders sometimes struggle with table layouts or sidebar numbers in NASA docs.

Would you like me to show you a "Clean-up" function to add to your Python script that removes these extra line breaks and artifacts before they get indexed?

Now use this and only this text """--- Result 1 | Score (Dist): 0.6513 | Page: 63 ---

3.7.3 134 If a project has safety-critical software or mission-critical software, the project manager ... at the software tool""" to summarize safety-critical software requirements

Based on the provided NASA NPR 7150.2D search results, the requirements for safety-critical software are categorized into identification, implementation logic, and quality metrics:

1. Identification and Governance

Classification: The project manager and the Safety and Mission Assurance (SMA) organization must determine if a component is safety-critical based on NASA-STD-8739.8 [SWE-205].
Compliance: All safety-critical software must implement the specific requirements found in NASA-STD-8739.8 [SWE-023].
IV&V: If Independent Verification and Validation is performed, the project manager must track submitted issues/risks to closure [SWE-179].

2. Mandatory Implementation Items [SWE-134]

If software is deemed safety-critical, the following functional behaviors must be implemented:

Safe States: The software must initialize, transition, and terminate (including during restarts) into known safe states.
Operator Control: Any operator override of software functions requires at least two independent actions.
Command Integrity: The system must reject out-of-sequence commands that could cause a hazard and perform prerequisite checks before executing safety-critical commands.
Fault Tolerance: The software must detect inadvertent memory modifications and perform integrity checks on all inputs and outputs.
Hazard Prevention: No single software event or action is permitted to initiate an identified hazard.

3. Technical Quality & Testing Metrics

Test Coverage: Safety-critical components must achieve 100 percent code test coverage using the Modified Condition/Decision Coverage (MC/DC) criterion [3.7.4].
Code Complexity: Components should maintain a cyclomatic complexity value of 15 or lower. Any exceedance requires a formal review and waiver with rationale [3.7.5].
Unit Testing: Testing must strictly follow the requirements established in section 3.7.4 of the document.

Would you like me to extract the specific requirements for the "Automatic Generation of Software Source Code" mentioned in Result 4?

If I run the PDF to ChromaDB code again against the same PDF file, will it replace the original in ChromaDB or create another copy?

By default, the code I provided will create a second copy (duplicate) of the data within the same database.

ChromaDB doesn't automatically "de-duplicate" based on file content. Every time you run Chroma.from_documents(), it generates new unique IDs (usually UUIDs) for every text chunk and adds them to the index. If you run it three times, a search for "Safety-Critical" will return the same paragraph three times.

Here is how you can handle this depending on your goal:

1. The "Clear and Start Over" Approach

If you want to ensure the database only contains the latest version of the PDF, you should delete the existing collection (or the folder) before processing.

Python
import shutil
import os

# Before processing the PDF:
if os.path.exists("./chroma_db"):
    shutil.rmtree("./chroma_db") 
    print("Old database cleared.")

# Then run your process_pdf_to_chroma() function...

2. The "Add Only if New" Approach (Upserting)

If you want to manage multiple PDFs without duplicates, you need to provide explicit IDs to Chroma. LangChain's Chroma class has an .add_documents() method that accepts an ids parameter.

A common trick is to hash the content of the chunk to create a unique ID. If the content is exactly the same, the ID will be the same, and Chroma will simply overwrite the existing entry instead of duplicating it.

3. Checking for Existing Data

If you just want to avoid re-processing a file that is already indexed, you can check if the collection has data before running the loader:

Python
vector_db = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
# Check if the database is empty
if len(vector_db.get()['ids']) > 0:
    print("Database already contains data. Skipping indexing.")
else:
    # Run your PDF loader here

Summary of Behavior

Action	Result in ChromaDB
Run script again	Data is duplicated (Search results will show doubles).
Delete `./chroma_db` folder	Database is reset; next run will be clean.
Use `ids` parameter	Data is "Upserted" (updated/inserted) without duplicates.

Would you like me to update the original script to include an "Overwrite" toggle or a check to see if the database already exists?

Can I add additional PDFs to the ChromaDB?

Yes, you absolutely can! ChromaDB is designed to be a growing repository. You can think of it like a digital library—you don't have to bring all your books on the first day; you can add new ones whenever you like.

However, to do this correctly, you need to change one line of code.

🔄 The Key Change: `add_documents` vs `from_documents`

In the original script, we used Chroma.from_documents(). This method is designed to initialize a new database. If you run it on an existing folder, it can sometimes cause conflicts or overwrite metadata.

To append new PDFs, you should load the existing database first and then use the .add_documents() method.

📝 Updated "Add to Database" Function

Here is a clean way to handle adding a new PDF to your existing collection:

Python
def add_pdf_to_existing_db(new_pdf_path, db_directory="./chroma_db"):
    # 1. Initialize the same embedding model
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

    # 2. Load the EXISTING database
    vector_db = Chroma(persist_directory=db_directory, embedding_function=embeddings)

    # 3. Load and split the NEW PDF
    loader = PyPDFLoader(new_pdf_path)
    new_docs = loader.load()
    
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    new_chunks = text_splitter.split_documents(new_docs)

    # 4. ADD to the database (instead of creating a new one)
    print(f"--- Adding {len(new_chunks)} chunks from {new_pdf_path} ---")
    vector_db.add_documents(new_chunks)
    
    print("--- Successfully added to the library! ---")
    return vector_db

🗂️ How to Manage Multiple PDFs

When you add multiple PDFs, the Metadata becomes your best friend. Because we used PyPDFLoader, every chunk automatically gets a source tag (the file path).

When you search in your GUI, you will now see results from different files:

Result 1 | Source: NASA_7150.pdf | Page: 63
Result 2 | Source: FAA_Safety_Manual.pdf | Page: 12

⚠️ Two Things to Watch Out For:

Duplicate Bloat: If you run the "Add" function twice on the same file, you will have two copies of that file in your database.
Consistency: You must use the exact same embedding model (all-MiniLM-L6-v2) for every PDF you add. If you change the model, the "math" between the old and new files won't match, and your search results will only find files from the newest model.

Would you like me to update the Tkinter GUI to include a "File Select" button so you can browse your computer and add new PDFs to the database via the interface?

If I have five documents in Chroma DB and I run a query, will the source reference be provided for each of the search results?

Yes, absolutely. As long as you are using the PyPDFLoader from my earlier code, every single search result will come with its own "ID card" (metadata) that tells you exactly where it came from.

When you query a ChromaDB containing five different documents, the database doesn't just return the text; it returns a Document Object. This object has two parts:

page_content: The actual text snippet.
metadata: A dictionary containing the source (file path) and the page number.

🕵️ How the GUI handles multiple sources

In the perform_search method we wrote earlier, this line is doing the heavy lifting for you:

source = doc.metadata.get('source', 'Unknown')
page = doc.metadata.get('page', '?')

header = f"--- Result {i+1} | Source: {source} | Page: {page} ---\n"

If your five documents are named NASA_1.pdf, NASA_2.pdf, etc., your results window will look like this:

--- Result 1 | Source: C:/Docs/NASA_7150.pdf | Page: 63 --- "The software is initialized to a known safe state..."
--- Result 2 | Source: C:/Docs/FAA_Manual.pdf | Page: 12 --- "Redundant power systems must be verified every 24 hours..."

💡 Three "Gotchas" to keep in mind:

The "Source" is a Path: By default, PyPDFLoader saves the full absolute path (e.g., C:\Users\Name\Documents\file.pdf) as the source. If the paths are very long, they might clutter your GUI.
- Fix: You can use os.path.basename(source) in your Python code to show just the filename (e.g., file.pdf).
Page Numbers start at 0: Most PDF loaders index the first page as Page 0. If your result says "Page 63," it’s actually the 64th page of the PDF.
- Fix: In your display code, you can use page = doc.metadata.get('page', 0) + 1.
Cross-Document Ranking: ChromaDB will rank the results based on relevance, not by document.
- If Document A is much more relevant than the others, all 5 of your "Top K" results might come from Document A, and you won't see anything from the other 4 documents.

Would you like me to show you how to "clean" those long file paths so they look nicer in your GUI results?

Will the source references and distance scores combined with the response text become noise to an LLM evaluating the results?

This is a fantastic question that gets to the heart of RAG (Retrieval-Augmented Generation) architecture.

The short answer: Yes, if you just "dump" everything in, it becomes noise. However, if handled correctly, that "noise" becomes the provenance (proof) the LLM needs to be accurate.

Here is how to handle the different elements when passing them to an LLM:

1. The Distance Score (The Noise)

Verdict: Keep it OUT.

The LLM does not understand what a "Distance Score of 0.6513" means in the context of a vector database. To the LLM, it's just a random number that might be mistaken for a version number, a date, or a requirement ID.

Recommendation: Use the score to filter results in your Python code, but do not send the raw score to the LLM.

2. The Source Reference (The Signal)

Verdict: Keep it IN, but format it.

LLMs are prone to "hallucinating" (making things up). By providing the source and page number, you can instruct the LLM to cite its sources. This turns the "noise" into an auditing tool.

Good Prompting: "Using the provided context, answer the question. At the end of your answer, list the Source and Page numbers used."

3. The Response Text (The Core)

Verdict: The most important part.

This is the "Context" the LLM uses to answer.

🛠️ The "Clean" Way to Feed an LLM

Instead of sending the raw string you see in your GUI, you should create a Context String in your Python background. Here is how that looks in code:

# 1. Get results from Chroma
results = db.similarity_search_with_score(query, k=3)

# 2. Build a "Clean" context for the LLM
context_blocks = []
for doc, score in results:
    # We skip the score here!
    filename = os.path.basename(doc.metadata.get('source', 'Unknown'))
    page = doc.metadata.get('page', 0) + 1
    
    # Format each chunk clearly for the LLM
    block = f"SOURCE: {filename} (Page {page})\nCONTENT: {doc.page_content}"
    context_blocks.append(block)

all_context = "\n\n---\n\n".join(context_blocks)

# 3. Create the final Prompt
prompt = f"""
You are a technical assistant. Answer the question using ONLY the context provided below.
If the answer isn't in the context, say you don't know.

### CONTEXT ###
{all_context}

### QUESTION ###
{query}

### ANSWER ###
"""

⚖️ Balancing "Noise" vs. "Context"

Element	LLM Impact	Action
Raw Text	Essential	Pass as-is.
Source/Page	Helpful	Pass it with a clear label like `[Source: X, Page: Y]`.
Distance Score	Harmful	Discard. Use it only for your own debugging or GUI display.
PDF Artifacts	Harmful	(e.g., `3.7.3 134`) These are minor noise; most modern LLMs (GPT-4o, Claude 3.5) are very good at ignoring these.

🚀 Pro Tip: "The Lost in the Middle" Phenomenon

If you set Top_K to a very high number (like 20), the LLM might struggle to find the needle in the haystack. Most LLMs are best at processing information at the beginning and end of the provided context. Keeping your Top_K between 3 and 7 is usually the "sweet spot" for accuracy.

Would you like me to show you how to integrate a free local LLM (like Ollama) so your Python app can actually generate a summary using these cleaned-up results?

Yes Granite-4-h-small

Combining a fast local LLM with your PDF search tool turns it from a simple "keyword finder" into a true AI Research Assistant.

The IBM Granite-3.1 series (and the newer Granite 4 hybrid models) are particularly good for this because they are "enterprise-ready"—meaning they are trained to be concise, follow instructions strictly, and handle long technical contexts (up to 128k tokens).

⚙️ Step 1: Setup Ollama

First, ensure you have Ollama installed. Then, open your terminal and pull the model:

# Using Granite 3.1 8B (Reliable for RAG)
ollama pull granite3.1:8b

# OR the ultra-fast Granite 4 hybrid "Small"
ollama pull granite4:small-h

🐍 Step 2: The Combined App (RAG + GUI)

This script merges our previous ChromaDB logic with a call to the local LLM. It "cleans" the context before sending it to the AI so the distance scores don't confuse it.

import tkinter as tk
from tkinter import ttk, messagebox
import os
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate

class GraniteRAGApp:
    def __init__(self, root, db_path="./chroma_db"):
        self.root = root
        self.root.title("NASA Standards AI Assistant (Granite-4)")
        self.root.geometry("800x700")

        # 1. Initialize Components
        try:
            self.embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
            self.db = Chroma(persist_directory=db_path, embedding_function=self.embeddings)
            
            # Initialize Granite via Ollama
            self.llm = ChatOllama(model="granite4:small-h", temperature=0)
        except Exception as e:
            messagebox.showerror("Error", f"Initialization failed: {e}")

        self.setup_ui()

    def setup_ui(self):
        # --- Input Section ---
        input_frame = ttk.Frame(self.root, padding="15")
        input_frame.pack(fill="x")

        ttk.Label(input_frame, text="Search Query:", font=("Arial", 10, "bold")).pack(anchor="w")
        self.query_entry = ttk.Entry(input_frame, width=70)
        self.query_entry.pack(fill="x", pady=5)
        self.query_entry.bind("<Return>", lambda e: self.run_rag())

        btn_frame = ttk.Frame(input_frame)
        btn_frame.pack(fill="x")

        self.search_btn = ttk.Button(btn_frame, text="🚀 Ask AI", command=self.run_rag)
        self.search_btn.pack(side="left", padx=5)
        
        ttk.Button(btn_frame, text="🗑️ Clear", command=self.clear_all).pack(side="right")

        # --- Output Section ---
        output_frame = ttk.Frame(self.root, padding="10")
        output_frame.pack(expand=True, fill="both")

        self.result_text = tk.Text(output_frame, wrap="word", padx=15, pady=15, font=("Consolas", 10))
        self.result_text.pack(side="left", expand=True, fill="both")
        
        scroller = ttk.Scrollbar(output_frame, command=self.result_text.yview)
        scroller.pack(side="right", fill="y")
        self.result_text.config(yscrollcommand=scroller.set)

    def run_rag(self):
        query = self.query_entry.get()
        if not query: return

        self.result_text.delete("1.0", tk.END)
        self.result_text.insert(tk.END, "🤖 Granite is thinking...\n" + "-"*30 + "\n")
        self.root.update()

        # 1. RETRIEVE
        # We get Top 4 chunks to give Granite enough context
        results_with_scores = self.db.similarity_search_with_score(query, k=4)

        # 2. CLEAN CONTEXT (Removing scores/noise for the LLM)
        context_list = []
        sources_list = []
        for doc, score in results_with_scores:
            fname = os.path.basename(doc.metadata.get('source', 'Unknown'))
            page = doc.metadata.get('page', 0) + 1
            context_list.append(f"[File: {fname}, Pg: {page}]\n{doc.page_content}")
            sources_list.append(f"- {fname} (Page {page}) [Dist: {score:.4f}]")

        formatted_context = "\n\n---\n\n".join(context_list)

        # 3. GENERATE (The Prompt)
        prompt = f"""You are a NASA Safety Standards expert. 
        Use the provided context to answer the user's question accurately.
        If the context doesn't contain the answer, say you don't know based on these documents.
        
        CONTEXT:
        {formatted_context}

        QUESTION: {query}
        
        ANSWER:"""

        try:
            response = self.llm.invoke(prompt)
            
            # Display Answer
            self.result_text.delete("1.0", tk.END)
            self.result_text.insert(tk.END, "💡 AI RESPONSE:\n", "bold")
            self.result_text.insert(tk.END, response.content + "\n\n")
            
            # Display Sources for Verification
            self.result_text.insert(tk.END, "📂 SOURCES USED:\n", "bold")
            self.result_text.insert(tk.END, "\n".join(sources_list))
            self.result_text.tag_configure("bold", font=("Arial", 11, "bold"))
            
        except Exception as e:
            messagebox.showerror("LLM Error", f"Could not reach Ollama: {e}")

    def clear_all(self):
        self.query_entry.delete(0, tk.END)
        self.result_text.delete("1.0", tk.END)

if __name__ == "__main__":
    root = tk.Tk()
    app = GraniteRAGApp(root)
    root.mainloop()

🧠 Why this works so well with Granite

Zero Hallucination: By giving Granite the specific "Only use this context" instruction, you prevent it from using its general training data to guess NASA requirements.
Distance Filtering: In the background, you still see the [Dist: 0.XXXX] scores in your source list. If you see a source with a score over 1.2, you know Granite had to "stretch" to include that information.
Privacy: Everything stays on your machine. No NASA data ever leaves your computer.

🚀 Performance Tip

If the response feels a bit slow, the Granite 4 Small model (small-h) is specifically optimized for "Hybrid" inference, meaning it can run very efficiently on standard consumer laptops while maintaining high accuracy for RAG tasks.

Would you like me to show you how to save these AI-generated summaries into a text file or log for your records?

💡 AI RESPONSE:

Based on the provided context from NASA Procedural Requirements document N_PR_7150_002D_, here are the key safety-critical software requirements:

1. Determine if each software component is safety-critical per the criteria in NASA-STD-8739.8 (SWE-205).

2. If a project has safety-critical software, implement the safety-critical software requirements in NASA-STD-8739.8 (SWE-023).

3. Specific requirements for safety-critical software include:

a. Initialization to a known safe state at first start and restarts (3.7.3a)

b. Safe state transitions between predefined states (3.7.3b)

c. Safe termination of software functions (3.7.3c)

d. At least two independent actions required for operator overrides (3.7.3d)

e. Rejection of out-of-sequence commands that could cause hazards (3.7.3e)

f. Detection and recovery from inadvertent memory modification (3.7.3f)

g. Integrity checks on software inputs and outputs (3.7.3g)

h. Prerequisite checks prior to executing safety-critical commands (3.7.3h)

i. No single event or action allowed to initiate an identified hazard (3.7.3i)

4. For safety-critical software, ensure 100% code test coverage using Modified Condition/Decision Coverage (MC/DC) criterion (3.7.4).

5. Ensure all safety-critical software components have a cyclomatic complexity value of 15 or lower (3.7.5). Any exceedances must be reviewed and waived with rationale.

These requirements aim to ensure the safe and reliable operation of safety-critical software systems in NASA projects.

📂 SOURCES USED:

- N_PR_7150_002D_.pdf (Page 26) [Dist: 0.6031]

- N_PR_7150_002D_.pdf (Page 65) [Dist: 0.6095]

- N_PR_7150_002D_.pdf (Page 64) [Dist: 0.6528]

- N_PR_7150_002D_.pdf (Page 35) [Dist: 0.6877]

From <https://gemini.google.com/app/728df01fbe9af51a> Google Gemini 3.0 (Fast)