#3723 Pre-Processing KG Documents for Entities and Relationships
#3723 Pre-Processing KG Documents for Entities and Relationships
#3723 I want to apply RAG within a knowledge graph that includes NASA Software Hazard Reports. Do the hazard reports need to be pre-processed to identify entities and relationships to add to the graph, or can Neo4j perform semantic search to get this information?
Yes, the NASA Software Hazard Reports need to be pre-processed to identify entities and relationships before you can effectively use them in a knowledge graph for Retrieval-Augmented Generation (RAG) with Neo4j.
While Neo4j has powerful semantic search capabilities, these are primarily used to query the data after it has been structured into a graph. The initial extraction of entities (like "software component," "hazard type," or "mitigation measure") and their relationships from the unstructured text of the reports is a critical pre-processing step.
Here's a breakdown of the process and Neo4j's role:
The Necessity of Pre-processing
Think of your hazard reports as a collection of books. Before you can ask complex questions about the characters and their connections across the entire library, you first need to read the books and identify who the characters are and how they relate to each other. Pre-processing your reports serves the same function.
This pre-processing step involves:
Entity Extraction: Identifying key nouns and concepts in the text.
For your use case, this could include: Software systems and components
Specific hazards or failures
Root causes
Corrective actions and mitigation strategies
Dates and report numbers
Relationship Extraction: Defining how these entities are connected.
For instance: A software component exhibits a specific hazard.
A root cause leads to a failure.
A mitigation strategy addresses a specific hazard.
How Neo4j Facilitates This Process
Neo4j doesn't perform this extraction out-of-the-box on raw text, but it provides robust tools and integrations, often leveraging Large Language Models (LLMs), to accomplish this. Tools like the Neo4j LLM Knowledge Graph Builder and integrations with frameworks such as LangChain can automate much of the heavy lifting.
The typical workflow looks like this:
Ingestion: The hazard reports (in formats like PDF or text) are ingested into a processing pipeline.
Chunking: The documents are broken down into smaller, manageable pieces of text.
Entity and Relationship Extraction: This is where LLMs come into play. You can provide the LLM with a schema of the types of entities and relationships you're interested in.
The LLM then reads the text chunks and extracts instances of these, often outputting them in a structured format like JSON. Graph Construction: The extracted entities and relationships are then loaded into Neo4j as nodes and edges, forming your knowledge graph.
Semantic Search and RAG: Once the knowledge graph is built, you can use Neo4j's Cypher query language, vector search, and graph algorithms to perform complex queries.
When integrated into a RAG pipeline, a user's question can be used to retrieve relevant subgraphs and context from your hazard reports, which is then fed to an LLM to generate a comprehensive and accurate answer.
In essence, you are not manually reading each report and drawing out the connections, but rather orchestrating a process where an LLM does the reading and structuring, and Neo4j provides the framework for storing and querying that structured knowledge. This "GraphRAG" approach combines the contextual understanding of a knowledge graph with the semantic search capabilities of vector databases, leading to more accurate and explainable results from your RAG application.
For a more in-depth understanding of creating knowledge graphs from text, you might find this video helpful:
I expect that this would apply to other documents like IRDs, ICDs, SRSs and ConOps.
Yes, absolutely. Your expectation is correct. The process of using a knowledge graph for Retrieval-Augmented Generation (RAG) is not only applicable but is arguably even more powerful when applied to a collection of documents like IRDs (Interface Requirements Documents), ICDs (Interface Control Documents), SRSs (Software Requirements Specifications), and ConOps (Concept of Operations).
Why This Approach is Ideal for Your Documents
These types of engineering documents are dense with interconnected information and are often difficult to analyze collectively. Creating a unified knowledge graph from them can transform your project's information landscape.
Breaking Down Silos: Each document provides a different view of the same system. An SRS details what the system must do, an ICD specifies how it communicates, and a ConOps explains why and how it's used. A knowledge graph can merge these views, creating a single, queryable model of your entire system.
Enabling Powerful Traceability: The core value of these documents lies in the relationships between them. By extracting entities (e.g.,
Requirement
,SystemComponent
,Interface
,OperationalScenario
) and their relationships (e.g.,ALLOCATED_TO
,VERIFIED_BY
,CONNECTS_TO
), you can ask complex questions that are nearly impossible to answer by reading documents manually:"Show me all software requirements from the SRS that are affected by the 'Telemetry Data' interface defined in the ICD."
"Which operational scenarios in the ConOps depend on the 'Attitude Control System' component, and what are its associated requirements in the SRS?"
"If we deprecate this interface in the IRD, what are all the upstream and downstream impacts across all documented systems?"
Enhancing RAG with Deep Context: When you ask a question, the RAG system can now retrieve a highly relevant, interconnected subgraph of information from across the entire document set. Instead of just getting a paragraph from the SRS, the system can provide a complete answer synthesized from the requirement itself, the interfaces it uses (from the ICD), and the operational context it supports (from the ConOps).
The pre-processing workflow remains the same: you would define a graph schema that includes entities and relationships from all these document types, use an LLM to perform the extraction, and load it into Neo4j. The result is a comprehensive "digital twin" of your project documentation, ready for sophisticated analysis and querying.
Provide python code to pre-process a document with an example schema to identify entities and relationships and add them to the knowledge graph
Of course. Here is a complete Python script that demonstrates how to pre-process a document using an LLM, define a schema with Pydantic, and load the extracted entities and relationships into a Neo4j knowledge graph.
This example uses the openai
library for entity extraction and the neo4j
driver to connect to the database.
Step 1: Prerequisites
First, you need to install the necessary Python libraries.
pip install neo4j openai pydantic python-dotenv
You'll also need access to a Neo4j database (like a free AuraDB instance) and an OpenAI API key. Store your credentials in a .env
file in the same directory as your script for security.
.env file:
# Neo4j Credentials
NEO4J_URI="neo4j+s://your-database-uri.databases.neo4j.io"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="your-database-password"
# OpenAI API Key
OPENAI_API_KEY="sk-your-openai-api-key"
Step 2: Python Script
This script defines a schema for requirements and system components, processes an example document against that schema, and uploads the result to your Neo4j instance. 🧑💻
import os
from typing import List, Literal
from dotenv import load_dotenv
import openai
from pydantic import BaseModel, Field
from neo4j import GraphDatabase
# --- 1. Load Credentials ---
load_dotenv()
# --- 2. Define the Knowledge Graph Schema with Pydantic ---
# This schema tells the LLM exactly what kind of information to extract.
class Node(BaseModel):
"""A node in the knowledge graph."""
id: str = Field(..., description="A unique identifier for the node.")
label: Literal["Requirement", "Component", "Function"] = Field(..., description="The type or label of the node.")
class Relationship(BaseModel):
"""A relationship between two nodes in the knowledge graph."""
source_id: str = Field(..., description="The ID of the source node.")
target_id: str = Field(..., description="The ID of the target node.")
type: Literal["ALLOCATED_TO", "PERFORMS", "VERIFIES"] = Field(..., description="The type of the relationship.")
description: str = Field(..., description="A description of the relationship.")
class KnowledgeGraph(BaseModel):
"""The full knowledge graph extracted from the text."""
nodes: List[Node] = Field(..., description="A list of all nodes in the graph.")
relationships: List[Relationship] = Field(..., description="A list of all relationships in the graph.")
# --- 3. Example Document to Process ---
# This is a sample text from a fictional Software Requirements Specification (SRS).
document_text = """
Software Requirements Specification for the Griffin Rover
1. Introduction
The Attitude Control System (ACS) is a critical component responsible for orientation.
2. Functional Requirements
REQ-001: The ACS shall calculate the rover's current orbital position. This function, named 'Calculate_Orbit', is allocated to the Flight Computer component.
REQ-002: The Flight Computer shall execute attitude adjustments. The 'Execute_Adjustment' function performs this.
REQ-003: The ACS shall verify the successful completion of attitude adjustments. The 'Verify_Adjustment' function, part of the ACS component, handles this verification.
"""
# --- 4. Function to Extract Graph using LLM ---
def extract_graph_from_text(text: str) -> KnowledgeGraph:
"""
Uses OpenAI's function calling feature to extract a knowledge graph
from a given text based on the Pydantic schema.
"""
print("🤖 Calling OpenAI to extract knowledge graph...")
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
response = client.chat.completions.create(
model="gpt-4o", # Or another capable model like gpt-4-turbo
messages=[
{
"role": "system",
"content": "You are an expert system engineer. Your task is to extract a knowledge graph from the provided text. Identify entities as nodes (Requirements, Components, Functions) and the relationships between them (ALLOCATED_TO, PERFORMS, VERIFIES). The ID for a requirement should be its identifier like 'REQ-001'. For other nodes, use a concise, descriptive name (e.g., 'Attitude_Control_System')."
},
{
"role": "user",
"content": f"Here is the document text:\n\n{text}"
}
],
tools=[
{
"type": "function",
"function": {
"name": "graph_extractor",
"description": "Extracts a knowledge graph from the text.",
"parameters": KnowledgeGraph.model_json_schema()
}
}
],
tool_choice={"type": "function", "function": {"name": "graph_extractor"}}
)
# The LLM's response with the graph data is in the tool_calls
tool_call = response.choices[0].message.tool_calls[0]
graph_data = tool_call.function.arguments
# Validate the data with Pydantic
return KnowledgeGraph.model_validate_json(graph_data)
# --- 5. Class to Upload the Graph to Neo4j ---
class Neo4jUploader:
def __init__(self, uri, user, password):
self._driver = GraphDatabase.driver(uri, auth=(user, password))
print("🔗 Connected to Neo4j database.")
def close(self):
self._driver.close()
print("🚪 Connection to Neo4j closed.")
def upload_graph(self, graph: KnowledgeGraph):
"""Uploads the Pydantic KnowledgeGraph object to Neo4j."""
print("Uploading graph to Neo4j...")
with self._driver.session() as session:
# Use a transaction to ensure all or nothing is written
session.execute_write(self._create_graph, graph)
print("✅ Graph upload complete!")
@staticmethod
def _create_graph(tx, graph: KnowledgeGraph):
# Create all nodes first, using MERGE to avoid duplicates
for node in graph.nodes:
tx.run(
"MERGE (n:{label} {{id: $id}})"
.format(label=node.label),
id=node.id
)
# Create all relationships
for rel in graph.relationships:
tx.run(
"""
MATCH (source {{id: $source_id}})
MATCH (target {{id: $target_id}})
MERGE (source)-[r:{type} {{description: $desc}}]->(target)
""".format(type=rel.type),
source_id=rel.source_id,
target_id=rel.target_id,
desc=rel.description
)
# --- 6. Main Execution ---
if __name__ == "__main__":
# Extract the graph from the document
extracted_graph = extract_graph_from_text(document_text)
# Print the extracted data
print("\n--- Extracted Knowledge Graph ---")
print(extracted_graph.model_dump_json(indent=2))
print("---------------------------------\n")
# Initialize the uploader and upload the graph
uploader = None
try:
uploader = Neo4jUploader(
uri=os.getenv("NEO4J_URI"),
user=os.getenv("NEO4J_USERNAME"),
password=os.getenv("NEO4J_PASSWORD")
)
uploader.upload_graph(extracted_graph)
print("\n🎉 Success! You can now query your graph in the Neo4j Browser.")
print("Try this Cypher query to see your graph:")
print("MATCH (n) RETURN n")
except Exception as e:
print(f"An error occurred: {e}")
finally:
if uploader:
uploader.close()
Step 3: How It Works
Schema Definition (Pydantic): The
Node
,Relationship
, andKnowledgeGraph
classes define the strict structure of the data you want to extract. This is crucial for getting reliable, structured JSON from the LLM.Extraction (OpenAI): The
extract_graph_from_text
function sends the document and the Pydantic schema to the OpenAI API. Thetool_choice
parameter forces the LLM to respond with a JSON object that matches yourKnowledgeGraph
schema.Validation: The line
KnowledgeGraph.model_validate_json(graph_data)
parses the LLM's JSON output and validates it against your schema. If the LLM returns malformed data, this will raise an error.Upload (Neo4j):
The
Neo4jUploader
class handles the connection to your database.The
upload_graph
method uses Cypher'sMERGE
command.MERGE
is idempotent, meaning it will create the node or relationship if it doesn't exist, or match it if it already does. This prevents creating duplicate data if you run the script multiple times.Using a transaction (
session.execute_write
) ensures that the entire graph is uploaded successfully, or nothing is uploaded at all, maintaining data integrity.
Repeat for an Ollama supported offline LLM like llama3.2
Excellent question. Adapting the workflow for a local, offline LLM using Ollama is a great way to ensure data privacy and avoid API costs. The core logic remains the same, but the method of interacting with the LLM changes.
The main difference is that most open-source models available through Ollama don't have the sophisticated "tool calling" or "function calling" features that OpenAI provides. Instead, we must be very explicit in our prompt, asking the model to generate a raw JSON string, which we will then parse.
Here is the complete, modified script for llama3.2
or another capable Ollama model.
Step 1: Prerequisites
Install Ollama: Follow the instructions on
to download and install Ollama on your system.ollama.com Pull the LLM: Open your terminal and pull the Llama 3.2 model.
ollama run llama3.2
(Note: Llama 3.2 is in preview as of late 2024/early 2025. If unavailable,
llama3
ormistral
are excellent alternatives for this task.)Install Python Libraries: The
openai
library is replaced withollama
.pip install neo4j ollama pydantic python-dotenv
Set up
.env
file: You only need your Neo4j credentials now. The Ollama client runs locally and doesn't require an API key..env file:
# Neo4j Credentials NEO4J_URI="neo4j+s://your-database-uri.databases.neo4j.io" NEO4J_USERNAME="neo4j" NEO4J_PASSWORD="your-database-password"
Step 2: Python Script for Ollama
This script is adapted to use the local Ollama API. Notice the significant changes in the extract_graph_from_text
function, particularly in the prompt and how the response is handled.
import os
import json
from typing import List, Literal
from dotenv import load_dotenv
import ollama
from pydantic import BaseModel, Field
from neo4j import GraphDatabase
# --- 1. Load Credentials ---
load_dotenv()
# --- 2. Define the Knowledge Graph Schema with Pydantic (No changes needed) ---
class Node(BaseModel):
"""A node in the knowledge graph."""
id: str = Field(..., description="A unique identifier for the node.")
label: Literal["Requirement", "Component", "Function"] = Field(..., description="The type or label of the node.")
class Relationship(BaseModel):
"""A relationship between two nodes in the knowledge graph."""
source_id: str = Field(..., description="The ID of the source node.")
target_id: str = Field(..., description="The ID of the target node.")
type: Literal["ALLOCATED_TO", "PERFORMS", "VERIFIES"] = Field(..., description="The type of the relationship.")
description: str = Field(..., description="A description of the relationship.")
class KnowledgeGraph(BaseModel):
"""The full knowledge graph extracted from the text."""
nodes: List[Node] = Field(..., description="A list of all nodes in the graph.")
relationships: List[Relationship] = Field(..., description="A list of all relationships in the graph.")
# --- 3. Example Document to Process (No changes needed) ---
document_text = """
Software Requirements Specification for the Griffin Rover
1. Introduction
The Attitude Control System (ACS) is a critical component responsible for orientation.
2. Functional Requirements
REQ-001: The ACS shall calculate the rover's current orbital position. This function, named 'Calculate_Orbit', is allocated to the Flight Computer component.
REQ-002: The Flight Computer shall execute attitude adjustments. The 'Execute_Adjustment' function performs this.
REQ-003: The ACS shall verify the successful completion of attitude adjustments. The 'Verify_Adjustment' function, part of the ACS component, handles this verification.
"""
# --- 4. Function to Extract Graph using Ollama ---
def extract_graph_from_text(text: str, model_name: str = "llama3.2") -> KnowledgeGraph:
"""
Uses a local Ollama model to extract a knowledge graph from a given text
by prompting for a JSON output.
"""
print(f"🤖 Calling local model '{model_name}' via Ollama to extract knowledge graph...")
# We provide the Pydantic schema as a JSON object in the prompt
# to guide the model's output.
schema_json = KnowledgeGraph.model_json_schema()
# Craft a detailed prompt that asks for JSON output
prompt_messages = [
{
"role": "system",
"content": f"""You are an expert system engineer. Your task is to extract a knowledge graph from the provided text.
Identify entities as nodes (Requirements, Components, Functions) and the relationships between them (ALLOCATED_TO, PERFORMS, VERIFIES).
The ID for a requirement should be its identifier like 'REQ-001'. For other nodes, use a concise, descriptive name (e.g., 'Attitude_Control_System').
You MUST respond with a single, valid JSON object that strictly adheres to the following JSON schema. Do not include any other text, explanations, or markdown formatting.
JSON Schema:
{json.dumps(schema_json, indent=2)}
"""
},
{
"role": "user",
"content": f"Here is the document text:\n\n{text}"
}
]
try:
response = ollama.chat(
model=model_name,
messages=prompt_messages,
format='json' # This is a key parameter that tells Ollama to ensure the output is valid JSON
)
# The response content should be a JSON string
response_content = response['message']['content']
# Validate the data with Pydantic
return KnowledgeGraph.model_validate_json(response_content)
except json.JSONDecodeError as e:
print(f"Error: The model did not return valid JSON. {e}")
print(f"Model output:\n{response_content}")
raise
except Exception as e:
print(f"An unexpected error occurred: {e}")
raise
# --- 5. Class to Upload the Graph to Neo4j (No changes needed) ---
class Neo4jUploader:
def __init__(self, uri, user, password):
self._driver = GraphDatabase.driver(uri, auth=(user, password))
print("🔗 Connected to Neo4j database.")
def close(self):
self._driver.close()
print("🚪 Connection to Neo4j closed.")
def upload_graph(self, graph: KnowledgeGraph):
print("Uploading graph to Neo4j...")
with self._driver.session() as session:
session.execute_write(self._create_graph, graph)
print("✅ Graph upload complete!")
@staticmethod
def _create_graph(tx, graph: KnowledgeGraph):
for node in graph.nodes:
tx.run("MERGE (n:{label} {{id: $id}})".format(label=node.label), id=node.id)
for rel in graph.relationships:
tx.run(
"MATCH (source {{id: $source_id}}) MATCH (target {{id: $target_id}}) "
"MERGE (source)-[r:{type} {{description: $desc}}]->(target)".format(type=rel.type),
source_id=rel.source_id, target_id=rel.target_id, desc=rel.description
)
# --- 6. Main Execution ---
if __name__ == "__main__":
try:
extracted_graph = extract_graph_from_text(document_text, model_name="llama3.2")
print("\n--- Extracted Knowledge Graph ---")
print(extracted_graph.model_dump_json(indent=2))
print("---------------------------------\n")
uploader = None
try:
uploader = Neo4jUploader(
uri=os.getenv("NEO4J_URI"),
user=os.getenv("NEO4J_USERNAME"),
password=os.getenv("NEO4J_PASSWORD")
)
uploader.upload_graph(extracted_graph)
print("\n🎉 Success! You can now query your graph in the Neo4j Browser.")
print("Try this Cypher query to see your graph:")
print("MATCH (n) RETURN n")
finally:
if uploader:
uploader.close()
except Exception as e:
print(f"Script failed: {e}")
Key Differences and Important Notes
Prompt Engineering is Crucial: We embed the desired JSON schema directly into the system prompt. This, combined with explicit instructions to only output JSON, is the most effective way to get structured data from local models.
Using
format='json'
: Theollama.chat
function includes aformat='json'
parameter. This is a powerful feature that instructs the model to strictly output valid JSON, significantly improving reliability.Reliability: While powerful, local models can sometimes be less consistent than state-of-the-art commercial models like GPT-4o. You may need to experiment with different models (
llama3
,mistral
, etc.) or refine the prompt if you get inconsistent or poorly formatted JSON. The error handling in the script will help you debug this.Performance: The extraction process will run on your local hardware. A machine with a powerful GPU is recommended for faster processing times.
Provide the system prompt for an example schema for a Software Hazard Report
Certainly. Here is a system prompt designed to extract a structured knowledge graph from a Software Hazard Report, complete with an example schema defined for clarity. This prompt is intended for a local LLM using a tool like Ollama where you need to be very explicit about the desired JSON output format.
System Prompt for Software Hazard Report Extraction
You are an expert safety engineer specializing in software systems. Your task is to read the provided Software Hazard Report and extract a detailed knowledge graph.
Identify all relevant entities as nodes and the connections between them as relationships. The nodes can be Hazards, Software Components, Causes, or Mitigations. The relationships can be IDENTIFIED_IN, CAUSED_BY, or MITIGATED_BY.
You MUST respond with a single, valid JSON object that strictly adheres to the following JSON schema. Do not include any other text, explanations, or markdown formatting like ```json
. Your entire response must be the raw JSON object.
Instructions for Populating the JSON:
IDs: For a
Hazard
with an official ID (e.g., "SHR-073"), use that as theid
. For all other nodes, create a concise, descriptive, snake_case ID (e.g.,avionics_display_unit
,timing_error
,add_redundancy_check
).Labels: Use only the four specified node labels:
Hazard
,SoftwareComponent
,Cause
,Mitigation
.Relationships: Only use the three specified relationship types:
IDENTIFIED_IN
,CAUSED_BY
,MITIGATED_BY
.
JSON Schema:
{
"$defs": {
"Node": {
"description": "A node in the knowledge graph.",
"properties": {
"id": {
"description": "A unique identifier for the node.",
"title": "Id",
"type": "string"
},
"label": {
"description": "The type or label of the node.",
"enum": [
"Hazard",
"SoftwareComponent",
"Cause",
"Mitigation"
],
"title": "Label",
"type": "string"
},
"description": {
"description": "A brief description of the node, extracted from the text.",
"title": "Description",
"type": "string"
}
},
"required": [
"id",
"label",
"description"
],
"title": "Node",
"type": "object"
},
"Relationship": {
"description": "A relationship between two nodes in the knowledge graph.",
"properties": {
"source_id": {
"description": "The ID of the source node.",
"title": "Source Id",
"type": "string"
},
"target_id": {
"description": "The ID of the target node.",
"title": "Target Id",
"type": "string"
},
"type": {
"description": "The type of the relationship.",
"enum": [
"IDENTIFIED_IN",
"CAUSED_BY",
"MITIGATED_BY"
],
"title": "Type",
"type": "string"
}
},
"required": [
"source_id",
"target_id",
"type"
],
"title": "Relationship",
"type": "object"
}
},
"properties": {
"nodes": {
"description": "A list of all nodes in the graph.",
"items": {
"$ref": "#/$defs/Node"
},
"title": "Nodes",
"type": "array"
},
"relationships": {
"description": "A list of all relationships in the graph.",
"items": {
"$ref": "#/$defs/Relationship"
},
"title": "Relationships",
"type": "array"
}
},
"required": [
"nodes",
"relationships"
],
"title": "KnowledgeGraph",
"type": "object"
}
Simplify this to ignore the knowledge graph (and Neo4j). Just apply the LLM to parse and extra information from the document text
Of course. Here is a simplified script that removes the knowledge graph and Neo4j components. It focuses solely on using a local LLM with Ollama to parse a document and extract key information into a structured format using Pydantic.
Prerequisites
You'll need Ollama installed and a model like llama3.2
pulled (ollama run llama3.2
). The only Python libraries required are ollama
and pydantic
.
pip install ollama pydantic
Simplified Python Script for Parsing
This script defines a simple HazardReport
data structure, provides an example document, and uses an Ollama model to parse the text and populate the structure.
import json
from typing import List, Literal
import ollama
from pydantic import BaseModel, Field
# --- 1. Define the Target Data Structure with Pydantic ---
# This class defines the specific pieces of information we want to extract.
class HazardReport(BaseModel):
"""A structured representation of a software hazard report."""
hazard_id: str = Field(..., description="The unique identifier of the hazard, like 'SHR-073'.")
description: str = Field(..., description="A clear, concise summary of the hazard.")
severity: Literal["Critical", "High", "Medium", "Low"] = Field(..., description="The assessed severity level of the hazard.")
affected_components: List[str] = Field(..., description="A list of software components or modules affected by the hazard.")
root_causes: List[str] = Field(..., description="A list of identified root causes for the hazard.")
mitigations: List[str] = Field(..., description="A list of actions or measures to mitigate the hazard.")
# --- 2. Example Software Hazard Report Document ---
document_text = """
**Software Hazard Report: SHR-073**
**1. Hazard Description:**
Incorrect landing trajectory calculated by the Guidance, Navigation, and Control (GNC) system may lead to a hard landing or mission failure.
**2. Affected Systems:**
The primary software component affected is the 'Trajectory Calculator' module within the GNC system. The 'Landing Sequencer' is also impacted.
**3. Assessed Severity:** Critical
**4. Causal Analysis:**
A root cause has been identified as a floating-point precision error under high-velocity calculations. A secondary cause is the lack of redundant validation checks for the calculated trajectory vector.
**5. Recommended Mitigations:**
- Implement a fixed-point arithmetic library for critical trajectory calculations.
- Add a cross-check validation routine where a simplified model verifies the primary calculation.
- Increase unit test coverage for the 'Trajectory Calculator' module.
"""
# --- 3. Function to Extract Information using Ollama ---
def extract_info_from_text(text: str, model_name: str = "llama3.2") -> HazardReport:
"""
Uses a local Ollama model to parse a document and extract structured
information based on the HazardReport Pydantic schema.
"""
print(f"🤖 Calling local model '{model_name}' to parse document...")
# We provide the Pydantic schema in the prompt to guide the model's output.
schema_json = HazardReport.model_json_schema()
# Craft a detailed prompt asking for a single JSON object.
prompt_messages = [
{
"role": "system",
"content": f"""You are an expert parsing engine. Your task is to read the provided Software Hazard Report and extract key information.
You MUST respond with a single, valid JSON object that strictly adheres to the following JSON schema.
Do not include any other text, explanations, or markdown formatting. Your entire response must be the raw JSON object.
JSON Schema:
{json.dumps(schema_json, indent=2)}
"""
},
{
"role": "user",
"content": f"Here is the document to parse:\n\n{text}"
}
]
try:
response = ollama.chat(
model=model_name,
messages=prompt_messages,
format='json' # Ensures the output from Ollama is a valid JSON string
)
response_content = response['message']['content']
# Validate the extracted data using our Pydantic model
return HazardReport.model_validate_json(response_content)
except Exception as e:
print(f"An unexpected error occurred: {e}")
# In a real application, you might want to retry or log this error.
raise
# --- 4. Main Execution ---
if __name__ == "__main__":
try:
# Call the function to parse the document
extracted_data = extract_info_from_text(document_text)
# Print the structured, validated data
print("\n--- ✅ Successfully Parsed Information ---")
print(extracted_data.model_dump_json(indent=2))
print("------------------------------------------")
except Exception as e:
print(f"\n--- ❌ Script failed ---")
print(f"Could not process the document. Error: {e}")
How It Works
This simplified script follows a straightforward Extract, Structure, Validate pattern:
Define a Schema: The
HazardReport
Pydantic class acts as a template for the information you want. It clearly defines the required fields (likehazard_id
,severity
, etc.) and their data types.Craft a Prompt: The system prompt explicitly tells the LLM its job: act like a parser and fill in the JSON template provided. The
format='json'
parameter in theollama.chat
call is a powerful hint for the model to comply.Extract and Validate: The LLM processes the text and generates a JSON string. The script then uses
HazardReport.model_validate_json()
to parse that string and load it into a Python object. This step is crucial because it validates the LLM's output against your schema, ensuring the data is clean and structured as you expect before you use it.Print the Result: The final output is a clean, predictable Python object, which you can now easily use in other parts of your application, save to a file, or send to an API.
From <https://gemini.google.com/app/> Google Gemini 2.5 Pro
Personal Note: Prototype by identifying entities and relationships. Do not add to KG.
Comments
Post a Comment