#3767 Two-Pass Knowledge Extraction :: Named Entity Recognition (NER) to Knowledge Graph

#3767 I want to create a 2-pass knowledge extraction application that parse a PDF, finds named entities and relationships and generate a list to be selected from for the second pass. The second pass then re-parses the document to extract those entities and relationships to produce a knowledge graph

Of course. Creating a 2-pass knowledge extraction application is an excellent approach for combining automated discovery with human-in-the-loop curation to generate high-quality knowledge graphs. This method allows you to cast a wide net in the first pass and then apply domain-specific focus in the second, improving the final output's relevance and accuracy.

Here’s a breakdown of the architecture, technology, and workflow.

Conceptual Workflow

The core idea is to separate the process into two distinct stages: a Discovery Pass that finds potential entities and relationships, and a Targeted Extraction Pass that uses a user's selection to build the final, refined knowledge graph.

Pass 1: Discovery and Candidate Generation 🧠

The goal of this pass is high recall—finding as many potential facts as possible, even if some are irrelevant.

PDF Ingestion and Text Extraction: The first step is to get clean, structured text from the PDF. The quality of this step is critical.
- Action: The application ingests a PDF file.
- Process: A parsing library extracts raw text, paying attention to reading order, tables, and headers. For scanned documents, Optical Character Recognition (OCR) is used.
- Output: A clean, plain text version of the document.
Broad Named Entity and Relation Extraction: Using a general-purpose NLP model, you'll perform a broad analysis of the text.
- Action: The extracted text is processed by a pre-trained NLP model.
- Process: The model identifies all named entities it recognizes (e.g., PERSON, ORGANIZATION, DATE, LOCATION) and predicts relationships between them (e.g., (Person, works_at, Organization)).
- Output: A collection of all unique entity types and relationship types found in the document.
Generate a Selectable List: The results from the previous step are aggregated and presented to the user.
- Action: The system generates a user interface (UI) element.
- Process: It creates a unique list of all entity labels (like PERSON, PRODUCT) and relation labels (like founded_by, located_in) that were discovered.
- Output: A checklist or multi-select dropdown for the user (e.g., ☐ PERSON (found 15), ☑ ORGANIZATION (found 8), ☑ founded_by (found 2)).

Pass 2: Targeted Extraction and Graph Generation 🎯

The goal of this pass is high precision—extracting only the information the user has deemed relevant.

Receive User Selections: The application takes the user's choices as the "schema" for the final knowledge graph.
- Action: The user selects the desired entity and relationship types from the list generated in Pass 1.
- Output: A filter list (e.g., ['ORGANIZATION', 'PERSON', 'founded_by']).
Filter and Re-process: The system now re-processes the document's extracted information, but this time with a narrow focus. Instead of re-parsing the entire document with the NLP model (which is computationally expensive), you can simply filter the results from Pass 1.
- Action: The system iterates through all the entities and relationships discovered in Pass 1.
- Process: It keeps only the entities and relationships whose labels match the user's selection.
- Output: A filtered list of entities and relationship triples.
Construct and Store the Knowledge Graph: The filtered data is now structured as a formal knowledge graph.
- Action: The system converts the filtered data into nodes (entities) and edges (relationships).
- Process: Each entity becomes a node with properties (e.g., Node: Apple, Label: ORGANIZATION). Each relationship becomes a directed edge between two nodes (e.g., (Steve Jobs) -[:FOUNDED_BY]-> (Apple)). This data is then loaded into a graph database.
- Output: A queryable knowledge graph, ready for visualization and analysis.

Recommended Technology Stack ⚙️

Here is a typical stack for building such an application, primarily using Python:

PDF Parsing:
- PyMuPDF (or fitz): Very fast and excellent for extracting text, metadata, and images from native PDFs.
- pdfplumber: Great for extracting data from tables.
- Pytesseract: An OCR engine needed for scanned, image-based PDFs.
NLP for Extraction (Pass 1):
- spaCy: An excellent all-in-one library. Its pre-trained models are fast and effective for general NER. You can also train its relation extraction component.
- Hugging Face Transformers: Provides access to state-of-the-art models (like BERT, RoBERTa) for more powerful and accurate NER and relation extraction, though they are more computationally intensive.
Knowledge Graph (Pass 2):
- NetworkX: A Python library for creating and manipulating graph structures in memory. Perfect for smaller projects or intermediate processing.
- Neo4j: A robust, native graph database for storing and querying the final knowledge graph. You can connect to it with Python drivers.
Backend & Frontend:
- FastAPI or Flask: To build a web API that handles PDF uploads and serves the selection UI.
- Streamlit: A fantastic choice for quickly building an interactive web application for this purpose without needing heavy frontend development. You can easily create file uploaders, display checklists, and visualize the graph.

Example Walkthrough

Let's say your PDF contains the sentence: "In 1976, Steve Wozniak and Steve Jobs co-founded Apple Inc. in Cupertino."

Pass 1 (Discovery):
- The NLP model runs and extracts:
  - Entities: 1976 (DATE), Steve Wozniak (PERSON), Steve Jobs (PERSON), Apple Inc. (ORGANIZATION), Cupertino (LOCATION).
  - Relationships: (Steve Wozniak, co_founded, Apple Inc.), (Steve Jobs, co_founded, Apple Inc.), (Apple Inc., located_in, Cupertino).
- The UI shows these candidate types: ☐ DATE, ☐ PERSON, ☐ ORGANIZATION, ☐ LOCATION, ☐ co_founded, ☐ located_in.
User Selection:
- The user is only interested in founders and companies. They check the boxes for PERSON, ORGANIZATION, and co_founded.
Pass 2 (Targeted Extraction):
- The system filters the results from Pass 1 based on the user's selection.
- The final triples for the knowledge graph are:
  - (Steve Wozniak, type, PERSON)
  - (Steve Jobs, type, PERSON)
  - (Apple Inc., type, ORGANIZATION)
  - (Steve Wozniak, co_founded, Apple Inc.)
  - (Steve Jobs, co_founded, Apple Inc.)
- These triples are loaded into a graph database like Neo4j. The DATE and LOCATION entities and the located_in relationship are ignored.

Is spaCy applied in both passes, or just the first? Is anything passed directly to an LLM for additional analysis?

Excellent questions that get to the heart of designing an efficient and powerful knowledge extraction pipeline.

Here’s a detailed breakdown of how spaCy and LLMs fit into the 2-pass architecture.

Is spaCy applied in both passes, or just the first?

Primarily, spaCy is applied comprehensively in the first pass, and its results are simply filtered in the second pass.

Pass 1: The Heavy Lifting This is where spaCy does all its work. You would create a spaCy pipeline that processes the clean text extracted from the PDF. This pipeline would typically include:
1. Tokenizer: Breaks text into words, punctuation, etc.
2. Part-of-Speech Tagger (tagger): Identifies nouns, verbs, etc.
3. Parser: Determines syntactic dependencies.
4. Named Entity Recognizer (ner): This is the core component that identifies entities like PERSON, ORG, GPE (Geopolitical Entity).
5. Custom Components: You would add a custom component for Relation Extraction. This component would use the dependency parse and named entities to find potential relationships (e.g., identifying that a PERSON is the subject of the verb "founded" and an ORG is the object).
The output of this pass is a rich, annotated Doc object containing all the discovered entities and relationships. This is your "data lake" of potential knowledge. Running this can be computationally intensive, which is why you only want to do it once.

Pass 2: The Fast Filtering The second pass should be fast and lightweight. Since you already have all the extracted information from Pass 1, there is no need to re-run the spaCy NLP pipeline.

Instead, you simply iterate through the stored results (the entities and relations from the Doc object) and filter them based on the user's selections.

# Conceptual Python Code
# Pass 1
nlp = spacy.load("en_core_web_lg")
# Add your custom relation extraction component to the pipeline
# nlp.add_pipe("my_relation_extractor", last=True)
doc = nlp(text_from_pdf)

# Store results
all_entities = doc.ents
all_relations = doc._.relations # Assuming relations are stored in a custom attribute

# --- User makes their selections here ---
user_selected_entities = ["PERSON", "ORG"]
user_selected_relations = ["founded_by"]

# Pass 2
final_entities = [ent for ent in all_entities if ent.label_ in user_selected_entities]
final_relations = [rel for rel in all_relations if rel.label_ in user_selected_relations]

# Now, build the knowledge graph from final_entities and final_relations

Re-running the NLP pipeline in Pass 2 would be highly inefficient and defeat the purpose of the 2-pass design.

Is anything passed directly to an LLM for additional analysis?

Yes, this is an increasingly common and powerful pattern. You can strategically use a Large Language Model (LLM) like one from Google's Gemini family or OpenAI's GPT series to augment or even replace parts of the spaCy pipeline.

Here are the most effective ways to integrate an LLM:

Zero-Shot or Few-Shot Relation Extraction (Pass 1)
- Challenge: Building a robust, rule-based or trained relation extraction model in spaCy can be complex and time-consuming.
- LLM Solution: LLMs are excellent at this. After spaCy identifies entities, you can feed the sentence containing those entities to an LLM with a carefully crafted prompt.
- Example Prompt:
  Context: "In 1976, Steve Wozniak and Steve Jobs co-founded Apple Inc. in Cupertino." Entities: "Steve Jobs" (PERSON), "Apple Inc." (ORGANIZATION) Instruction: What is the relationship between "Steve Jobs" and "Apple Inc."? Use one of the following labels: [works_for, founded_by, is_ceo_of, competitor_of]. If no relationship exists, return "None". LLM Output: founded_by
This approach can often yield higher accuracy for complex or subtly stated relationships than a traditional NLP model, making your Pass 1 discovery much richer.
Entity Normalization/Canonicalization (Pass 1 or 2)
- Challenge: Your document might mention "Apple," "Apple Inc.," and "Apple Computer Inc." These are all the same entity.
- LLM Solution: You can ask an LLM to normalize these entities to a single, canonical form.
- Example Prompt:
  Instruction: Resolve the entity "Apple Computer Inc." to its canonical name and provide its Wikidata ID. LLM Output: {"name": "Apple Inc.", "id": "Q312"}
This adds a layer of standardization to your knowledge graph.
Complex Query Answering and Summarization (Post-Pass 2)
- Challenge: A knowledge graph stores structured facts but doesn't interpret them.
- LLM Solution: Once your KG is built, you can use an LLM as a natural language interface to it. A system can convert a user's question ("Who were the key people at Apple in the early days?") into a graph query (e.g., Cypher for Neo4j), retrieve the facts, and then use the LLM to synthesize the results into a human-readable summary.

Summary Table:

Task	`spaCy` (Traditional NLP)	Large Language Model (LLM)	Recommendation
PDF Text Extraction	N/A (Uses libraries like `PyMuPDF`)	N/A	Use dedicated PDF libraries.
Named Entity Recognition	Excellent and very fast. Pre-trained models are highly optimized.	Good, but can be slower and more expensive.	Use `spaCy` in Pass 1 for speed and efficiency.
Relation Extraction	Good, but requires a custom-trained or rule-based component.	Excellent with zero-shot prompting. Can uncover more nuanced relationships.	Combine both: Use `spaCy` for NER, then feed sentences with entities to an LLM for relation extraction in Pass 1.
Filtering (Pass 2)	N/A (Simple code logic)	N/A	Use simple filtering logic in your code.
Entity Normalization	Possible with custom rules or gazetteers.	Excellent for resolving to canonical forms using its vast world knowledge.	Use an LLM for this as a refinement step.

Provide a python application to do this for a NASA software hazard report. Assume that the user selects all identified entities and relationships from the first pass.

Of course. Here is a complete Python application that performs the 2-pass knowledge extraction on a sample NASA software hazard report.

This script uses spaCy for entity and relationship discovery (Pass 1) and NetworkX with pyvis to build and visualize the knowledge graph (Pass 2). Since the prompt assumes the user selects everything, Pass 2 simply processes all findings from Pass 1.

Prerequisites

You'll need to install the required libraries and download a spaCy model.

# Install libraries
pip install spacy PyMuPDF networkx pyvis

# Download a spaCy English model
python -m spacy download en_core_web_lg

Python Application Code

This application is structured into a single script for simplicity. It includes:

Text Extraction: A function to read text from a PDF.
Knowledge Extractor Class: A class to handle both extraction passes.
Graph Visualization: A function to create an interactive HTML graph.
Main Execution Block: An example demonstrating how to use the code.

import spacy
import fitz  # PyMuPDF
import networkx as nx
from pyvis.network import Network
from typing import List, Tuple, Dict, Any

# --- Sample text from a hypothetical NASA Software Hazard Report ---
# In a real application, this text would be extracted from a PDF.
SAMPLE_TEXT = """
Hazard Report ID: HZ-2025-042
System: Orion Multi-Purpose Crew Vehicle (MPCV)
Subsystem: Guidance, Navigation, and Control (GNC)
Hazard Title: Erroneous State Vector Calculation
Description: The Flight Control Unit (FCU) software, module 'calc_pos.c', may produce an erroneous state vector if the primary GPS unit, designated GPS-1, fails to provide a timely data packet. Software lead Dr. Evelyn Reed noted that this issue is exacerbated by a race condition in the real-time operating system (RTOS). The backup system, an Inertial Measurement Unit (IMU), is designed to take over, but a software flag 'is_gps_primary' is not correctly cleared by the error handling routine developed by contractor Stellar Space Corp. This could lead to the GNC system trusting a stale vector, causing an incorrect orbital adjustment burn. A similar issue was noted in the Ares V program. The fix requires patching the RTOS scheduler.
"""

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extracts text from a given PDF file."""
    try:
        doc = fitz.open(pdf_path)
        text = ""
        for page in doc:
            text += page.get_text()
        return text
    except Exception as e:
        print(f"Error reading PDF {pdf_path}: {e}")
        return ""

class KnowledgeExtractor:
    """
    Manages the 2-pass knowledge extraction process.
    """
    def __init__(self, model: str = "en_core_web_lg"):
        """Initializes the extractor with a spaCy model."""
        self.nlp = spacy.load(model)
        self.entities: List[Dict[str, Any]] = []
        self.relations: List[Dict[str, Any]] = []

    def _extract_relationships(self, doc: spacy.tokens.Doc) -> List[Tuple[str, str, str]]:
        """A simple rule-based method to find subject-verb-object triples."""
        triples = []
        for token in doc:
            # Look for a verb
            if "VB" in token.tag_:
                subject = None
                obj = None
                # Find subject linked by nsubj
                for child in token.children:
                    if "nsubj" in child.dep_ and child.ent_type_:
                        subject = child.head.text
                    if "dobj" in child.dep_ and child.ent_type_:
                        obj = child.text
                
                if subject and obj:
                    # Basic cleaning: use entity text if available
                    subject_ent = next((ent for ent in doc.ents if subject in ent.text), None)
                    obj_ent = next((ent for ent in doc.ents if obj in ent.text), None)
                    
                    if subject_ent and obj_ent:
                         triples.append((subject_ent.text.strip(), token.lemma_, obj_ent.text.strip()))
        return triples

    def perform_pass_1(self, text: str):
        """
        Performs the discovery pass to find all potential entities and relationships.
        """
        print("--- Starting Pass 1: Discovery ---")
        doc = self.nlp(text)

        # 1. Extract Named Entities
        found_entities = {}
        for ent in doc.ents:
            # Avoid adding duplicate entities
            if ent.text.strip() not in found_entities:
                self.entities.append({"text": ent.text.strip(), "label": ent.label_})
                found_entities[ent.text.strip()] = ent.label_
        
        print(f"Found {len(self.entities)} unique entities.")
        
        # 2. Extract Relationships (simple Subject-Verb-Object triples for demonstration)
        extracted_relations = self._extract_relationships(doc)
        for subj, pred, obj in extracted_relations:
             self.relations.append({"subject": subj, "predicate": pred, "object": obj})
        
        print(f"Found {len(self.relations)} potential relationships.")
        print("--- Pass 1 Complete ---")

    def perform_pass_2(self) -> Tuple[List[Dict], List[Dict]]:
        """
        Performs the targeted extraction pass.
        
        For this problem, we assume the user selects EVERYTHING found in Pass 1.
        Therefore, this pass simply returns all the data collected.
        """
        print("\n--- Starting Pass 2: Targeted Extraction ---")
        print("User selected all entities and relationships.")
        
        # In a real app, this step would filter based on user input.
        # Here, we just pass everything through.
        final_entities = self.entities
        final_relations = self.relations

        print(f"Finalizing {len(final_entities)} entities and {len(final_relations)} relationships for KG.")
        print("--- Pass 2 Complete ---")
        return final_entities, final_relations

def create_knowledge_graph(entities: List[Dict], relations: List[Dict], output_filename: str = "knowledge_graph.html"):
    """Creates and saves an interactive knowledge graph visualization."""
    print("\n--- Building Knowledge Graph ---")
    
    # Define colors for different entity types for better visualization
    entity_colors = {
        "ORG": "#FFA07A",      # Light Salmon
        "PERSON": "#20B2AA",   # Light Sea Green
        "PRODUCT": "#778899",  # Light Slate Gray
        "GPE": "#9370DB",      # Medium Purple
        "FAC": "#3CB371",      # Medium Sea Green
        "EVENT": "#F0E68C",    # Khaki
        "DEFAULT": "#D3D3D3"   # Light Grey
    }

    # Create a pyvis network
    net = Network(height="800px", width="100%", bgcolor="#222222", font_color="white", notebook=False, directed=True)
    
    # Add nodes (entities) to the graph
    for entity in entities:
        label = entity["label"]
        color = entity_colors.get(label, entity_colors["DEFAULT"])
        net.add_node(entity["text"], label=entity["text"], title=f"Entity: {label}", color=color)

    # Add edges (relationships) to the graph
    for rel in relations:
        try:
            net.add_edge(rel["subject"], rel["object"], title=rel["predicate"], label=rel["predicate"])
        except Exception as e:
            print(f"Could not add edge for relation {rel}: {e}. Node might be missing.")

    net.show_buttons(filter_=['physics'])
    net.save_graph(output_filename)
    print(f"Successfully created knowledge graph: {output_filename}")


if __name__ == "__main__":
    # In a real scenario, you would use the PDF function:
    # report_text = extract_text_from_pdf("path/to/your/report.pdf")
    # For this example, we use the sample text directly.
    report_text = SAMPLE_TEXT

    # Initialize and run the extractor
    extractor = KnowledgeExtractor()
    
    # Pass 1: Discover everything
    extractor.perform_pass_1(report_text)
    
    # This is where a UI would present the findings to the user for selection.
    print("\n[Simulating User Interaction: User reviews and selects all items...]\n")
    
    # Pass 2: Filter and finalize based on user selection
    final_entities, final_relations = extractor.perform_pass_2()
    
    # Create and save the knowledge graph from the results of Pass 2
    create_knowledge_graph(final_entities, final_relations)

How to Run the Code

Save the code above as a Python file (e.g., hazard_parser.py).
Run it from your terminal: python hazard_parser.py

Output

The script will print the progress of each pass to the console:

--- Starting Pass 1: Discovery ---
Found 10 unique entities.
Found 2 potential relationships.
--- Pass 1 Complete ---

[Simulating User Interaction: User reviews and selects all items...]

--- Starting Pass 2: Targeted Extraction ---
User selected all entities and relationships.
Finalizing 10 entities and 2 relationships for KG.
--- Pass 2 Complete ---

--- Building Knowledge Graph ---
Successfully created knowledge graph: knowledge_graph.html

Most importantly, it will generate an interactive HTML file named knowledge_graph.html. Open this file in your web browser.

You will see a graph where:

Nodes are the entities like "Orion Multi-Purpose Crew Vehicle", "Dr. Evelyn Reed", and "Stellar Space Corp". Nodes are color-coded by their entity type (e.g., PERSON, ORGANIZATION).
Edges are the relationships connecting them, with the verb as the label (e.g., "develop").

You can drag the nodes around, zoom in, and hover over them to see more details. This provides an intuitive way to explore the connections and potential failure points described in the hazard report.

If the user had actually selected a subset of entities and relationships, how would the selected subsets be used by extractor.perform_pass_2()?

If the user selected a subset, the perform_pass_2 method would use those selections to filter the complete list of entities and relationships discovered in Pass 1.

The method's signature would be modified to accept the user's choices, and its internal logic would change from passing everything through to actively selecting the desired items.

Modified `perform_pass_2()` Method

Here’s how you would rewrite the perform_pass_2() method inside the KnowledgeExtractor class to handle this filtering.

# (Inside the KnowledgeExtractor class from the previous example)

def perform_pass_2(self, 
                   selected_entity_labels: List[str], 
                   selected_relation_predicates: List[str]
                   ) -> Tuple[List[Dict], List[Dict]]:
    """
    Performs the targeted extraction pass by filtering results from Pass 1
    based on the user's selections.
    """
    print("\n--- Starting Pass 2: Targeted Extraction ---")
    print(f"Filtering based on user selection: {selected_entity_labels} & {selected_relation_predicates}")

    # 1. Filter entities based on their LABEL
    final_entities = [
        entity for entity in self.entities 
        if entity["label"] in selected_entity_labels
    ]

    # 2. Filter relationships based on their PREDICATE (the verb)
    final_relations = [
        relation for relation in self.relations 
        if relation["predicate"] in selected_relation_predicates
    ]
    
    # 3. An additional cleanup step: Ensure all entities in the final
    #    relationships are also in the final entities list.
    final_entity_texts = {entity['text'] for entity in final_entities}
    final_relations = [
        rel for rel in final_relations
        if rel['subject'] in final_entity_texts and rel['object'] in final_entity_texts
    ]


    print(f"Finalizing {len(final_entities)} entities and {len(final_relations)} relationships for KG.")
    print("--- Pass 2 Complete ---")
    return final_entities, final_relations

How It Works: Step-by-Step

Accept User Input: The method now takes two arguments:
- selected_entity_labels: A list of strings like ['PERSON', 'ORG'].
- selected_relation_predicates: A list of strings (the verbs/predicates) like ['develop', 'note'].
Filter Entities: It iterates through self.entities (the full list from Pass 1) and keeps only those whose "label" key is present in the selected_entity_labels list.
Filter Relationships: It does the same for self.relations, keeping only the relationships whose "predicate" key matches one of the selected_relation_predicates.
Ensure Consistency: An important final step is to make sure that the subjects and objects of the final relationships actually exist in the final entity list. For example, if the user deselects PERSON but keeps a relationship like (Dr. Evelyn Reed, noted, issue), this cleanup step would remove that relationship because its subject (Dr. Evelyn Reed) is no longer a valid entity in the final set.

Example Usage

Here’s how you would call this new method in the main execution block.

if __name__ == "__main__":
    report_text = SAMPLE_TEXT
    extractor = KnowledgeExtractor()
    
    # Pass 1: Discover everything
    extractor.perform_pass_1(report_text)
    
    # [Simulating User Interaction]
    # The user is presented with all found labels: 
    # Entities: ['ORG', 'PERSON', 'PRODUCT', 'GPE']
    # Relations: ['produce', 'note']
    # The user decides they only care about organizations, products, and who "noted" what.
    
    user_entity_selection = ["ORG", "PRODUCT"]
    user_relation_selection = ["note"]
    
    print(f"\nUser selected entity types: {user_entity_selection}")
    print(f"User selected relation types: {user_relation_selection}\n")

    # Pass 2: Call the method with the user's specific selections
    final_entities, final_relations = extractor.perform_pass_2(
        selected_entity_labels=user_entity_selection,
        selected_relation_predicates=user_relation_selection
    )
    
    # The knowledge graph will now be much more focused
    create_knowledge_graph(final_entities, final_relations, "focused_knowledge_graph.html")

The resulting focused_knowledge_graph.html would be a smaller, more targeted graph containing only nodes for organizations and products, and only the edges representing the "note" relationship, effectively filtering out all the noise the user deemed irrelevant.

From <https://gemini.google.com/app/> Google Gemini 2.5 Pro

Search This Blog

Learning from ChatGPT