#3723 Pre-Processing KG Documents for Entities and Relationships

#3723 Pre-Processing KG Documents for Entities and Relationships

#3723 I want to apply RAG within a knowledge graph that includes NASA Software Hazard Reports. Do the hazard reports need to be pre-processed to identify entities and relationships to add to the graph, or can Neo4j perform semantic search to get this information?

Yes, the NASA Software Hazard Reports need to be pre-processed to identify entities and relationships before you can effectively use them in a knowledge graph for Retrieval-Augmented Generation (RAG) with Neo4j. However, Neo4j and its ecosystem of tools can significantly streamline this process.

While Neo4j has powerful semantic search capabilities, these are primarily used to query the data after it has been structured into a graph. The initial extraction of entities (like "software component," "hazard type," or "mitigation measure") and their relationships from the unstructured text of the reports is a critical pre-processing step.

Here's a breakdown of the process and Neo4j's role:

The Necessity of Pre-processing

Think of your hazard reports as a collection of books. Before you can ask complex questions about the characters and their connections across the entire library, you first need to read the books and identify who the characters are and how they relate to each other. Pre-processing your reports serves the same function.

This pre-processing step involves:

  • Entity Extraction: Identifying key nouns and concepts in the text. For your use case, this could include:

    • Software systems and components

    • Specific hazards or failures

    • Root causes

    • Corrective actions and mitigation strategies

    • Dates and report numbers

  • Relationship Extraction: Defining how these entities are connected. For instance:

    • A software component exhibits a specific hazard.

    • A root cause leads to a failure.

    • A mitigation strategy addresses a specific hazard.

How Neo4j Facilitates This Process

Neo4j doesn't perform this extraction out-of-the-box on raw text, but it provides robust tools and integrations, often leveraging Large Language Models (LLMs), to accomplish this. Tools like the Neo4j LLM Knowledge Graph Builder and integrations with frameworks such as LangChain can automate much of the heavy lifting.

The typical workflow looks like this:

  1. Ingestion: The hazard reports (in formats like PDF or text) are ingested into a processing pipeline.

  2. Chunking: The documents are broken down into smaller, manageable pieces of text.

  3. Entity and Relationship Extraction: This is where LLMs come into play. You can provide the LLM with a schema of the types of entities and relationships you're interested in. The LLM then reads the text chunks and extracts instances of these, often outputting them in a structured format like JSON.

  4. Graph Construction: The extracted entities and relationships are then loaded into Neo4j as nodes and edges, forming your knowledge graph.

  5. Semantic Search and RAG: Once the knowledge graph is built, you can use Neo4j's Cypher query language, vector search, and graph algorithms to perform complex queries. When integrated into a RAG pipeline, a user's question can be used to retrieve relevant subgraphs and context from your hazard reports, which is then fed to an LLM to generate a comprehensive and accurate answer.

In essence, you are not manually reading each report and drawing out the connections, but rather orchestrating a process where an LLM does the reading and structuring, and Neo4j provides the framework for storing and querying that structured knowledge. This "GraphRAG" approach combines the contextual understanding of a knowledge graph with the semantic search capabilities of vector databases, leading to more accurate and explainable results from your RAG application.

For a more in-depth understanding of creating knowledge graphs from text, you might find this video helpful: Langchain & Neo4j: Creating Knowledge Graphs from Text


I expect that this would apply to other documents like IRDs, ICDs, SRSs and ConOps.

Yes, absolutely. Your expectation is correct. The process of using a knowledge graph for Retrieval-Augmented Generation (RAG) is not only applicable but is arguably even more powerful when applied to a collection of documents like IRDs (Interface Requirements Documents), ICDs (Interface Control Documents), SRSs (Software Requirements Specifications), and ConOps (Concept of Operations).


Why This Approach is Ideal for Your Documents

These types of engineering documents are dense with interconnected information and are often difficult to analyze collectively. Creating a unified knowledge graph from them can transform your project's information landscape.

  • Breaking Down Silos: Each document provides a different view of the same system. An SRS details what the system must do, an ICD specifies how it communicates, and a ConOps explains why and how it's used. A knowledge graph can merge these views, creating a single, queryable model of your entire system.

  • Enabling Powerful Traceability: The core value of these documents lies in the relationships between them. By extracting entities (e.g., Requirement, SystemComponent, Interface, OperationalScenario) and their relationships (e.g., ALLOCATED_TO, VERIFIED_BY, CONNECTS_TO), you can ask complex questions that are nearly impossible to answer by reading documents manually:

    • "Show me all software requirements from the SRS that are affected by the 'Telemetry Data' interface defined in the ICD."

    • "Which operational scenarios in the ConOps depend on the 'Attitude Control System' component, and what are its associated requirements in the SRS?"

    • "If we deprecate this interface in the IRD, what are all the upstream and downstream impacts across all documented systems?"

  • Enhancing RAG with Deep Context: When you ask a question, the RAG system can now retrieve a highly relevant, interconnected subgraph of information from across the entire document set. Instead of just getting a paragraph from the SRS, the system can provide a complete answer synthesized from the requirement itself, the interfaces it uses (from the ICD), and the operational context it supports (from the ConOps).

The pre-processing workflow remains the same: you would define a graph schema that includes entities and relationships from all these document types, use an LLM to perform the extraction, and load it into Neo4j. The result is a comprehensive "digital twin" of your project documentation, ready for sophisticated analysis and querying.


Provide python code to pre-process a document with an example schema to identify entities and relationships and add them to the knowledge graph

Of course. Here is a complete Python script that demonstrates how to pre-process a document using an LLM, define a schema with Pydantic, and load the extracted entities and relationships into a Neo4j knowledge graph.

This example uses the openai library for entity extraction and the neo4j driver to connect to the database.


Step 1: Prerequisites

First, you need to install the necessary Python libraries.

pip install neo4j openai pydantic python-dotenv

You'll also need access to a Neo4j database (like a free AuraDB instance) and an OpenAI API key. Store your credentials in a .env file in the same directory as your script for security.

.env file:

# Neo4j Credentials
NEO4J_URI="neo4j+s://your-database-uri.databases.neo4j.io"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="your-database-password"

# OpenAI API Key
OPENAI_API_KEY="sk-your-openai-api-key"

Step 2: Python Script

This script defines a schema for requirements and system components, processes an example document against that schema, and uploads the result to your Neo4j instance. 🧑‍💻

import os
from typing import List, Literal
from dotenv import load_dotenv

import openai
from pydantic import BaseModel, Field
from neo4j import GraphDatabase

# --- 1. Load Credentials ---
load_dotenv()

# --- 2. Define the Knowledge Graph Schema with Pydantic ---
# This schema tells the LLM exactly what kind of information to extract.

class Node(BaseModel):
    """A node in the knowledge graph."""
    id: str = Field(..., description="A unique identifier for the node.")
    label: Literal["Requirement", "Component", "Function"] = Field(..., description="The type or label of the node.")

class Relationship(BaseModel):
    """A relationship between two nodes in the knowledge graph."""
    source_id: str = Field(..., description="The ID of the source node.")
    target_id: str = Field(..., description="The ID of the target node.")
    type: Literal["ALLOCATED_TO", "PERFORMS", "VERIFIES"] = Field(..., description="The type of the relationship.")
    description: str = Field(..., description="A description of the relationship.")

class KnowledgeGraph(BaseModel):
    """The full knowledge graph extracted from the text."""
    nodes: List[Node] = Field(..., description="A list of all nodes in the graph.")
    relationships: List[Relationship] = Field(..., description="A list of all relationships in the graph.")

# --- 3. Example Document to Process ---
# This is a sample text from a fictional Software Requirements Specification (SRS).
document_text = """
Software Requirements Specification for the Griffin Rover

1. Introduction
The Attitude Control System (ACS) is a critical component responsible for orientation.

2. Functional Requirements
REQ-001: The ACS shall calculate the rover's current orbital position. This function, named 'Calculate_Orbit', is allocated to the Flight Computer component.
REQ-002: The Flight Computer shall execute attitude adjustments. The 'Execute_Adjustment' function performs this.
REQ-003: The ACS shall verify the successful completion of attitude adjustments. The 'Verify_Adjustment' function, part of the ACS component, handles this verification.
"""

# --- 4. Function to Extract Graph using LLM ---
def extract_graph_from_text(text: str) -> KnowledgeGraph:
    """
    Uses OpenAI's function calling feature to extract a knowledge graph
    from a given text based on the Pydantic schema.
    """
    print("🤖 Calling OpenAI to extract knowledge graph...")
    client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    response = client.chat.completions.create(
        model="gpt-4o", # Or another capable model like gpt-4-turbo
        messages=[
            {
                "role": "system",
                "content": "You are an expert system engineer. Your task is to extract a knowledge graph from the provided text. Identify entities as nodes (Requirements, Components, Functions) and the relationships between them (ALLOCATED_TO, PERFORMS, VERIFIES). The ID for a requirement should be its identifier like 'REQ-001'. For other nodes, use a concise, descriptive name (e.g., 'Attitude_Control_System')."
            },
            {
                "role": "user",
                "content": f"Here is the document text:\n\n{text}"
            }
        ],
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "graph_extractor",
                    "description": "Extracts a knowledge graph from the text.",
                    "parameters": KnowledgeGraph.model_json_schema()
                }
            }
        ],
        tool_choice={"type": "function", "function": {"name": "graph_extractor"}}
    )

    # The LLM's response with the graph data is in the tool_calls
    tool_call = response.choices[0].message.tool_calls[0]
    graph_data = tool_call.function.arguments
    
    # Validate the data with Pydantic
    return KnowledgeGraph.model_validate_json(graph_data)


# --- 5. Class to Upload the Graph to Neo4j ---
class Neo4jUploader:
    def __init__(self, uri, user, password):
        self._driver = GraphDatabase.driver(uri, auth=(user, password))
        print("🔗 Connected to Neo4j database.")

    def close(self):
        self._driver.close()
        print("🚪 Connection to Neo4j closed.")

    def upload_graph(self, graph: KnowledgeGraph):
        """Uploads the Pydantic KnowledgeGraph object to Neo4j."""
        print("Uploading graph to Neo4j...")
        with self._driver.session() as session:
            # Use a transaction to ensure all or nothing is written
            session.execute_write(self._create_graph, graph)
        print("✅ Graph upload complete!")

    @staticmethod
    def _create_graph(tx, graph: KnowledgeGraph):
        # Create all nodes first, using MERGE to avoid duplicates
        for node in graph.nodes:
            tx.run(
                "MERGE (n:{label} {{id: $id}})"
                .format(label=node.label),
                id=node.id
            )
        
        # Create all relationships
        for rel in graph.relationships:
            tx.run(
                """
                MATCH (source {{id: $source_id}})
                MATCH (target {{id: $target_id}})
                MERGE (source)-[r:{type} {{description: $desc}}]->(target)
                """.format(type=rel.type),
                source_id=rel.source_id,
                target_id=rel.target_id,
                desc=rel.description
            )


# --- 6. Main Execution ---
if __name__ == "__main__":
    # Extract the graph from the document
    extracted_graph = extract_graph_from_text(document_text)
    
    # Print the extracted data
    print("\n--- Extracted Knowledge Graph ---")
    print(extracted_graph.model_dump_json(indent=2))
    print("---------------------------------\n")

    # Initialize the uploader and upload the graph
    uploader = None
    try:
        uploader = Neo4jUploader(
            uri=os.getenv("NEO4J_URI"),
            user=os.getenv("NEO4J_USERNAME"),
            password=os.getenv("NEO4J_PASSWORD")
        )
        uploader.upload_graph(extracted_graph)
        
        print("\n🎉 Success! You can now query your graph in the Neo4j Browser.")
        print("Try this Cypher query to see your graph:")
        print("MATCH (n) RETURN n")

    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        if uploader:
            uploader.close()

Step 3: How It Works

  1. Schema Definition (Pydantic): The Node, Relationship, and KnowledgeGraph classes define the strict structure of the data you want to extract. This is crucial for getting reliable, structured JSON from the LLM.

  2. Extraction (OpenAI): The extract_graph_from_text function sends the document and the Pydantic schema to the OpenAI API. The tool_choice parameter forces the LLM to respond with a JSON object that matches your KnowledgeGraph schema.

  3. Validation: The line KnowledgeGraph.model_validate_json(graph_data) parses the LLM's JSON output and validates it against your schema. If the LLM returns malformed data, this will raise an error.

  4. Upload (Neo4j):

    • The Neo4jUploader class handles the connection to your database.

    • The upload_graph method uses Cypher's MERGE command. MERGE is idempotent, meaning it will create the node or relationship if it doesn't exist, or match it if it already does. This prevents creating duplicate data if you run the script multiple times.

    • Using a transaction (session.execute_write) ensures that the entire graph is uploaded successfully, or nothing is uploaded at all, maintaining data integrity.


Repeat for an Ollama supported offline LLM like llama3.2

Excellent question. Adapting the workflow for a local, offline LLM using Ollama is a great way to ensure data privacy and avoid API costs. The core logic remains the same, but the method of interacting with the LLM changes.

The main difference is that most open-source models available through Ollama don't have the sophisticated "tool calling" or "function calling" features that OpenAI provides. Instead, we must be very explicit in our prompt, asking the model to generate a raw JSON string, which we will then parse.

Here is the complete, modified script for llama3.2 or another capable Ollama model.


Step 1: Prerequisites

  1. Install Ollama: Follow the instructions on ollama.com to download and install Ollama on your system.

  2. Pull the LLM: Open your terminal and pull the Llama 3.2 model.

    ollama run llama3.2
    

    (Note: Llama 3.2 is in preview as of late 2024/early 2025. If unavailable, llama3 or mistral are excellent alternatives for this task.)

  3. Install Python Libraries: The openai library is replaced with ollama.

    pip install neo4j ollama pydantic python-dotenv
    
  4. Set up .env file: You only need your Neo4j credentials now. The Ollama client runs locally and doesn't require an API key.

    .env file:

    # Neo4j Credentials
    NEO4J_URI="neo4j+s://your-database-uri.databases.neo4j.io"
    NEO4J_USERNAME="neo4j"
    NEO4J_PASSWORD="your-database-password"
    

Step 2: Python Script for Ollama

This script is adapted to use the local Ollama API. Notice the significant changes in the extract_graph_from_text function, particularly in the prompt and how the response is handled.

import os
import json
from typing import List, Literal
from dotenv import load_dotenv

import ollama
from pydantic import BaseModel, Field
from neo4j import GraphDatabase

# --- 1. Load Credentials ---
load_dotenv()

# --- 2. Define the Knowledge Graph Schema with Pydantic (No changes needed) ---
class Node(BaseModel):
    """A node in the knowledge graph."""
    id: str = Field(..., description="A unique identifier for the node.")
    label: Literal["Requirement", "Component", "Function"] = Field(..., description="The type or label of the node.")

class Relationship(BaseModel):
    """A relationship between two nodes in the knowledge graph."""
    source_id: str = Field(..., description="The ID of the source node.")
    target_id: str = Field(..., description="The ID of the target node.")
    type: Literal["ALLOCATED_TO", "PERFORMS", "VERIFIES"] = Field(..., description="The type of the relationship.")
    description: str = Field(..., description="A description of the relationship.")

class KnowledgeGraph(BaseModel):
    """The full knowledge graph extracted from the text."""
    nodes: List[Node] = Field(..., description="A list of all nodes in the graph.")
    relationships: List[Relationship] = Field(..., description="A list of all relationships in the graph.")

# --- 3. Example Document to Process (No changes needed) ---
document_text = """
Software Requirements Specification for the Griffin Rover

1. Introduction
The Attitude Control System (ACS) is a critical component responsible for orientation.

2. Functional Requirements
REQ-001: The ACS shall calculate the rover's current orbital position. This function, named 'Calculate_Orbit', is allocated to the Flight Computer component.
REQ-002: The Flight Computer shall execute attitude adjustments. The 'Execute_Adjustment' function performs this.
REQ-003: The ACS shall verify the successful completion of attitude adjustments. The 'Verify_Adjustment' function, part of the ACS component, handles this verification.
"""

# --- 4. Function to Extract Graph using Ollama ---
def extract_graph_from_text(text: str, model_name: str = "llama3.2") -> KnowledgeGraph:
    """
    Uses a local Ollama model to extract a knowledge graph from a given text
    by prompting for a JSON output.
    """
    print(f"🤖 Calling local model '{model_name}' via Ollama to extract knowledge graph...")

    # We provide the Pydantic schema as a JSON object in the prompt
    # to guide the model's output.
    schema_json = KnowledgeGraph.model_json_schema()

    # Craft a detailed prompt that asks for JSON output
    prompt_messages = [
        {
            "role": "system",
            "content": f"""You are an expert system engineer. Your task is to extract a knowledge graph from the provided text.
            Identify entities as nodes (Requirements, Components, Functions) and the relationships between them (ALLOCATED_TO, PERFORMS, VERIFIES).
            The ID for a requirement should be its identifier like 'REQ-001'. For other nodes, use a concise, descriptive name (e.g., 'Attitude_Control_System').
            You MUST respond with a single, valid JSON object that strictly adheres to the following JSON schema. Do not include any other text, explanations, or markdown formatting.
            
            JSON Schema:
            {json.dumps(schema_json, indent=2)}
            """
        },
        {
            "role": "user",
            "content": f"Here is the document text:\n\n{text}"
        }
    ]

    try:
        response = ollama.chat(
            model=model_name,
            messages=prompt_messages,
            format='json' # This is a key parameter that tells Ollama to ensure the output is valid JSON
        )
        
        # The response content should be a JSON string
        response_content = response['message']['content']
        
        # Validate the data with Pydantic
        return KnowledgeGraph.model_validate_json(response_content)

    except json.JSONDecodeError as e:
        print(f"Error: The model did not return valid JSON. {e}")
        print(f"Model output:\n{response_content}")
        raise
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        raise


# --- 5. Class to Upload the Graph to Neo4j (No changes needed) ---
class Neo4jUploader:
    def __init__(self, uri, user, password):
        self._driver = GraphDatabase.driver(uri, auth=(user, password))
        print("🔗 Connected to Neo4j database.")

    def close(self):
        self._driver.close()
        print("🚪 Connection to Neo4j closed.")

    def upload_graph(self, graph: KnowledgeGraph):
        print("Uploading graph to Neo4j...")
        with self._driver.session() as session:
            session.execute_write(self._create_graph, graph)
        print("✅ Graph upload complete!")

    @staticmethod
    def _create_graph(tx, graph: KnowledgeGraph):
        for node in graph.nodes:
            tx.run("MERGE (n:{label} {{id: $id}})".format(label=node.label), id=node.id)
        for rel in graph.relationships:
            tx.run(
                "MATCH (source {{id: $source_id}}) MATCH (target {{id: $target_id}}) "
                "MERGE (source)-[r:{type} {{description: $desc}}]->(target)".format(type=rel.type),
                source_id=rel.source_id, target_id=rel.target_id, desc=rel.description
            )

# --- 6. Main Execution ---
if __name__ == "__main__":
    try:
        extracted_graph = extract_graph_from_text(document_text, model_name="llama3.2")
        
        print("\n--- Extracted Knowledge Graph ---")
        print(extracted_graph.model_dump_json(indent=2))
        print("---------------------------------\n")

        uploader = None
        try:
            uploader = Neo4jUploader(
                uri=os.getenv("NEO4J_URI"),
                user=os.getenv("NEO4J_USERNAME"),
                password=os.getenv("NEO4J_PASSWORD")
            )
            uploader.upload_graph(extracted_graph)
            
            print("\n🎉 Success! You can now query your graph in the Neo4j Browser.")
            print("Try this Cypher query to see your graph:")
            print("MATCH (n) RETURN n")

        finally:
            if uploader:
                uploader.close()

    except Exception as e:
        print(f"Script failed: {e}")

Key Differences and Important Notes

  • Prompt Engineering is Crucial: We embed the desired JSON schema directly into the system prompt. This, combined with explicit instructions to only output JSON, is the most effective way to get structured data from local models.

  • Using format='json': The ollama.chat function includes a format='json' parameter. This is a powerful feature that instructs the model to strictly output valid JSON, significantly improving reliability.

  • Reliability: While powerful, local models can sometimes be less consistent than state-of-the-art commercial models like GPT-4o. You may need to experiment with different models (llama3, mistral, etc.) or refine the prompt if you get inconsistent or poorly formatted JSON. The error handling in the script will help you debug this.

  • Performance: The extraction process will run on your local hardware. A machine with a powerful GPU is recommended for faster processing times.


Provide the system prompt for an example schema for a Software Hazard Report

Certainly. Here is a system prompt designed to extract a structured knowledge graph from a Software Hazard Report, complete with an example schema defined for clarity. This prompt is intended for a local LLM using a tool like Ollama where you need to be very explicit about the desired JSON output format.


System Prompt for Software Hazard Report Extraction

You are an expert safety engineer specializing in software systems. Your task is to read the provided Software Hazard Report and extract a detailed knowledge graph.

Identify all relevant entities as nodes and the connections between them as relationships. The nodes can be Hazards, Software Components, Causes, or Mitigations. The relationships can be IDENTIFIED_IN, CAUSED_BY, or MITIGATED_BY.

You MUST respond with a single, valid JSON object that strictly adheres to the following JSON schema. Do not include any other text, explanations, or markdown formatting like ```json. Your entire response must be the raw JSON object.

Instructions for Populating the JSON:

  • IDs: For a Hazard with an official ID (e.g., "SHR-073"), use that as the id. For all other nodes, create a concise, descriptive, snake_case ID (e.g., avionics_display_unit, timing_error, add_redundancy_check).

  • Labels: Use only the four specified node labels: Hazard, SoftwareComponent, Cause, Mitigation.

  • Relationships: Only use the three specified relationship types: IDENTIFIED_IN, CAUSED_BY, MITIGATED_BY.

JSON Schema:

{
  "$defs": {
    "Node": {
      "description": "A node in the knowledge graph.",
      "properties": {
        "id": {
          "description": "A unique identifier for the node.",
          "title": "Id",
          "type": "string"
        },
        "label": {
          "description": "The type or label of the node.",
          "enum": [
            "Hazard",
            "SoftwareComponent",
            "Cause",
            "Mitigation"
          ],
          "title": "Label",
          "type": "string"
        },
        "description": {
            "description": "A brief description of the node, extracted from the text.",
            "title": "Description",
            "type": "string"
        }
      },
      "required": [
        "id",
        "label",
        "description"
      ],
      "title": "Node",
      "type": "object"
    },
    "Relationship": {
      "description": "A relationship between two nodes in the knowledge graph.",
      "properties": {
        "source_id": {
          "description": "The ID of the source node.",
          "title": "Source Id",
          "type": "string"
        },
        "target_id": {
          "description": "The ID of the target node.",
          "title": "Target Id",
          "type": "string"
        },
        "type": {
          "description": "The type of the relationship.",
          "enum": [
            "IDENTIFIED_IN",
            "CAUSED_BY",
            "MITIGATED_BY"
          ],
          "title": "Type",
          "type": "string"
        }
      },
      "required": [
        "source_id",
        "target_id",
        "type"
      ],
      "title": "Relationship",
      "type": "object"
    }
  },
  "properties": {
    "nodes": {
      "description": "A list of all nodes in the graph.",
      "items": {
        "$ref": "#/$defs/Node"
      },
      "title": "Nodes",
      "type": "array"
    },
    "relationships": {
      "description": "A list of all relationships in the graph.",
      "items": {
        "$ref": "#/$defs/Relationship"
      },
      "title": "Relationships",
      "type": "array"
    }
  },
  "required": [
    "nodes",
    "relationships"
  ],
  "title": "KnowledgeGraph",
  "type": "object"
}


Simplify this to ignore the knowledge graph (and Neo4j). Just apply the LLM to parse and extra information from the document text

Of course. Here is a simplified script that removes the knowledge graph and Neo4j components. It focuses solely on using a local LLM with Ollama to parse a document and extract key information into a structured format using Pydantic.


Prerequisites

You'll need Ollama installed and a model like llama3.2 pulled (ollama run llama3.2). The only Python libraries required are ollama and pydantic.

pip install ollama pydantic

Simplified Python Script for Parsing

This script defines a simple HazardReport data structure, provides an example document, and uses an Ollama model to parse the text and populate the structure.

import json
from typing import List, Literal

import ollama
from pydantic import BaseModel, Field

# --- 1. Define the Target Data Structure with Pydantic ---
# This class defines the specific pieces of information we want to extract.

class HazardReport(BaseModel):
    """A structured representation of a software hazard report."""
    hazard_id: str = Field(..., description="The unique identifier of the hazard, like 'SHR-073'.")
    description: str = Field(..., description="A clear, concise summary of the hazard.")
    severity: Literal["Critical", "High", "Medium", "Low"] = Field(..., description="The assessed severity level of the hazard.")
    affected_components: List[str] = Field(..., description="A list of software components or modules affected by the hazard.")
    root_causes: List[str] = Field(..., description="A list of identified root causes for the hazard.")
    mitigations: List[str] = Field(..., description="A list of actions or measures to mitigate the hazard.")

# --- 2. Example Software Hazard Report Document ---

document_text = """
**Software Hazard Report: SHR-073**

**1. Hazard Description:**
Incorrect landing trajectory calculated by the Guidance, Navigation, and Control (GNC) system may lead to a hard landing or mission failure.

**2. Affected Systems:**
The primary software component affected is the 'Trajectory Calculator' module within the GNC system. The 'Landing Sequencer' is also impacted.

**3. Assessed Severity:** Critical

**4. Causal Analysis:**
A root cause has been identified as a floating-point precision error under high-velocity calculations. A secondary cause is the lack of redundant validation checks for the calculated trajectory vector.

**5. Recommended Mitigations:**
- Implement a fixed-point arithmetic library for critical trajectory calculations.
- Add a cross-check validation routine where a simplified model verifies the primary calculation.
- Increase unit test coverage for the 'Trajectory Calculator' module.
"""

# --- 3. Function to Extract Information using Ollama ---

def extract_info_from_text(text: str, model_name: str = "llama3.2") -> HazardReport:
    """
    Uses a local Ollama model to parse a document and extract structured
    information based on the HazardReport Pydantic schema.
    """
    print(f"🤖 Calling local model '{model_name}' to parse document...")

    # We provide the Pydantic schema in the prompt to guide the model's output.
    schema_json = HazardReport.model_json_schema()

    # Craft a detailed prompt asking for a single JSON object.
    prompt_messages = [
        {
            "role": "system",
            "content": f"""You are an expert parsing engine. Your task is to read the provided Software Hazard Report and extract key information.
            You MUST respond with a single, valid JSON object that strictly adheres to the following JSON schema.
            Do not include any other text, explanations, or markdown formatting. Your entire response must be the raw JSON object.

            JSON Schema:
            {json.dumps(schema_json, indent=2)}
            """
        },
        {
            "role": "user",
            "content": f"Here is the document to parse:\n\n{text}"
        }
    ]

    try:
        response = ollama.chat(
            model=model_name,
            messages=prompt_messages,
            format='json'  # Ensures the output from Ollama is a valid JSON string
        )
        
        response_content = response['message']['content']
        
        # Validate the extracted data using our Pydantic model
        return HazardReport.model_validate_json(response_content)

    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        # In a real application, you might want to retry or log this error.
        raise

# --- 4. Main Execution ---

if __name__ == "__main__":
    try:
        # Call the function to parse the document
        extracted_data = extract_info_from_text(document_text)
        
        # Print the structured, validated data
        print("\n--- ✅ Successfully Parsed Information ---")
        print(extracted_data.model_dump_json(indent=2))
        print("------------------------------------------")

    except Exception as e:
        print(f"\n--- ❌ Script failed ---")
        print(f"Could not process the document. Error: {e}")

How It Works

This simplified script follows a straightforward Extract, Structure, Validate pattern:

  1. Define a Schema: The HazardReport Pydantic class acts as a template for the information you want. It clearly defines the required fields (like hazard_id, severity, etc.) and their data types.

  2. Craft a Prompt: The system prompt explicitly tells the LLM its job: act like a parser and fill in the JSON template provided. The format='json' parameter in the ollama.chat call is a powerful hint for the model to comply.

  3. Extract and Validate: The LLM processes the text and generates a JSON string. The script then uses HazardReport.model_validate_json() to parse that string and load it into a Python object. This step is crucial because it validates the LLM's output against your schema, ensuring the data is clean and structured as you expect before you use it.

  4. Print the Result: The final output is a clean, predictable Python object, which you can now easily use in other parts of your application, save to a file, or send to an API.


From <https://gemini.google.com/app/>  Google Gemini 2.5 Pro


Personal Note: Prototype by identifying entities and relationships. Do not add to KG.

Comments

Popular posts from this blog

#2892 Example of ChatGPT o1 Prompt

#3105 Federal and State Spending Idiocy