#3933 Instructions for Hazard Report Summary
#3933 Instructions for Hazard Report Summary
#3933 Provide instructions to an LLM to respond with this content:
Hazard Report = {
'Title':'title',
'Description':'description',
'Causes': [
'Cause':'cause', 'Controls': [],
'Cause':'cause', 'Controls': []
]
}
Here are precise instructions you can paste into an LLM (like Gemini, ChatGPT, or Claude) to force it to output that specific structure.
I have designed this as a System Prompt. You should paste this in first to define the AI's behavior, and then provide your scenario.
The System Prompt
Copy and paste the text in the block below:
You are a Safety and Mission Assurance automated assistant. Your sole purpose is to analyze hazard reports and output a structured Hazard Report Summary.
Output Format Rules:
You must respond only with a Python dictionary assigned to the variable
Hazard Report.Do not use Markdown formatting (no ``` code blocks).
Do not include conversational filler (e.g., "Here is the report").
Follow this exact schema:
Hazard Report = {
'Title': 'Short summary of the hazard',
'Description': 'Detailed description of the unsafe condition',
'Causes': [
{'Cause': 'Specific cause or contributing factor', 'Controls': ['Control 1', 'Control 2']},
{'Cause': 'Another cause', 'Controls': ['Control A', 'Control B']}
]
}
Instruction: Analyze the hazard report provided by the user. Extract the hazard title, description, distinct causes, and any existing or suggested controls. Map them into the schema above.
Example Usage
Your Prompt:
[Paste the System Prompt above first]
[Paste or Upload/Attach hazard report]
LLM Response:
Hazard Report = {
'Title': 'Fall Hazard at Cryo-Tank Valve Assembly',
'Description': 'Technician required to lean over unguarded railing to access pressure gauge on slick surface.',
'Causes': [
{'Cause': 'Unguarded railing near maintenance point', 'Controls': ['Install permanent guardrails', 'Relocate pressure gauge to accessible height']},
{'Cause': 'Slick floor due to condensation', 'Controls': ['Apply non-slip coating to floor', 'Manage condensation drainage']},
{'Cause': 'Lack of fall protection equipment', 'Controls': ['Mandatory harness usage for valve access', 'Install anchor points']}
]
}
Why this works
Role Definition: It assigns the AI a "Safety and Mission Assurance" role, which primes it to look for technical causes and controls rather than generic advice.
Negative Constraints: Explicitly forbidding Markdown (
```) and conversational filler ensures you get raw text that is ready to be pasted directly into a code editor or parsed by a script.One-shot Example: By including the schema in the prompt, the LLM understands exactly how to nest the
Causeslist and theControlslist.
Would you like me to write a Python script to parse this string output into a usable object?
Yes
Here is a Python script designed to ingest the raw string output from the LLM and convert it into a usable dictionary object.
I have used the ast (Abstract Syntax Tree) library rather than json or eval().
Why not
json? The LLM format uses single quotes ('Title') and Python-specific syntax, which will causejson.loadsto fail.Why not
eval()?eval()poses a security risk if the LLM hallucinates malicious code.ast.literal_evalsafely evaluates only Python literals (strings, numbers, tuples, lists, dicts).
Python Parser Script
import ast
def parse_hazard_report(llm_raw_output):
"""
Parses a string containing a Python dictionary assignment
into a native Python dictionary object.
"""
# 1. Clean the string
# The LLM outputs "Hazard Report = { ... }"
# We need to strip the variable assignment to get just the dict structure: "{ ... }"
target_prefix = "Hazard Report ="
clean_str = llm_raw_output.strip()
if clean_str.startswith(target_prefix):
# Slice off the prefix
clean_str = clean_str[len(target_prefix):].strip()
try:
# 2. Safely evaluate the string
# ast.literal_eval parses the string as a Python literal structure
# without executing functions or arbitrary code.
data_object = ast.literal_eval(clean_str)
return data_object
except SyntaxError as e:
print(f"Parsing Failed: The LLM output was not valid Python syntax.\nError: {e}")
return None
except ValueError as e:
print(f"Parsing Failed: The LLM output contained unsafe or complex constructs.\nError: {e}")
return None
# --- Example Usage ---
# This represents the raw string returned by the LLM
llm_response_string = """
Hazard Report = {
'Title': 'Fall Hazard at Cryo-Tank Valve Assembly',
'Description': 'Technician required to lean over unguarded railing to access pressure gauge.',
'Causes': [
{'Cause': 'Unguarded railing near maintenance point', 'Controls': ['Install permanent guardrails']},
{'Cause': 'Slick floor due to condensation', 'Controls': ['Apply non-slip coating', 'Drainage mgmt']}
]
}
"""
# Parse the string
report_dict = parse_hazard_report(llm_response_string)
# Verify it works by accessing keys programmatically
if report_dict:
print(f"Hazard Report: {report_dict['Title']}\n")
print(f"Description: {report_dict['Description']}\n")
for i, entry in enumerate(report_dict['Causes'], 1):
print(f"Cause {i}: {entry['Cause']}")
print(f" Controls: {', '.join(entry['Controls'])}")
Output of the Script:
Hazard Report: Fall Hazard at Cryo-Tank Valve Assembly
Description: Technician required to lean over unguarded railing to access pressure gauge.
Cause 1: Unguarded railing near maintenance point
Controls: Install permanent guardrails
Cause 2: Slick floor due to condensation
Controls: Apply non-slip coating, Drainage mgmt
Next Steps for Implementation
Since you are likely integrating this into a larger pipeline:
Validation: You might want to add a Pydantic model step immediately after parsing to ensure the dictionary strictly adheres to your schema (e.g., ensuring
Causesis always a list).Storage: Once parsed into
report_dict, you can easily dump this to a JSON file (json.dump), a SQL database, or a CSV row.
Would you like me to create a Pydantic model to validate the data structure after parsing?
Add python code to read each PDF file from a user-specified directory and send the file, system instructions and user prompt to Google Gemini LLM, then pass the response to parse_hazard_report()
Here is the complete Python workflow. This script uses the google-generativeai SDK to process the PDFs and integrates the parser we built earlier.
Prerequisites
You will need to install the Google Generative AI library:
pip install google-generativeai
Complete Python Script
import os
import glob
import time
import ast
import google.generativeai as genai
from google.generativeai.types import HarmCategory, HarmBlockThreshold
# --- CONFIGURATION ---
API_KEY = "YOUR_GOOGLE_API_KEY" # Replace with your actual key
MODEL_NAME = "gemini-1.5-flash" # Flash is faster/cheaper for text extraction tasks
# Configure the SDK
genai.configure(api_key=API_KEY)
# --- SYSTEM INSTRUCTION ---
# This matches the prompt we designed earlier to force the specific Python dict format.
SYSTEM_PROMPT = """
You are a Safety and Mission Assurance automated assistant. Your sole purpose is to analyze input PDF documents and output a structured Hazard Report.
Output Format Rules:
1. You must respond ONLY with a Python dictionary assigned to the variable 'Hazard Report'.
2. Do not use Markdown formatting (no ``` code blocks).
3. Do not include conversational filler.
4. Follow this exact schema:
Hazard Report = {
'Title': 'Short summary of the hazard',
'Description': 'Detailed description of the unsafe condition',
'Causes': [
{'Cause': 'Specific root cause or contributing factor', 'Controls': ['Control 1', 'Control 2']},
{'Cause': 'Another root cause', 'Controls': ['Control A', 'Control B']}
]
}
Instruction:
Analyze the provided PDF document. Extract the hazard title, description, distinct causes, and any existing or suggested controls. Map them into the schema above.
"""
def parse_hazard_report(llm_raw_output):
"""
Parses a string containing a Python dictionary assignment
into a native Python dictionary object.
"""
target_prefix = "Hazard Report ="
clean_str = llm_raw_output.strip()
# Strip markdown code blocks if the LLM ignores instructions and adds them anyway
if clean_str.startswith("```python"):
clean_str = clean_str.replace("```python", "").replace("```", "")
elif clean_str.startswith("```"):
clean_str = clean_str.replace("```", "")
if clean_str.startswith(target_prefix):
clean_str = clean_str[len(target_prefix):].strip()
try:
data_object = ast.literal_eval(clean_str)
return data_object
except (SyntaxError, ValueError) as e:
print(f" [!] Parsing Error: {e}")
return None
def process_pdfs_in_directory(directory_path):
# Check if directory exists
if not os.path.exists(directory_path):
print(f"Error: Directory '{directory_path}' not found.")
return
# Find all PDF files
pdf_files = glob.glob(os.path.join(directory_path, "*.pdf"))
if not pdf_files:
print(f"No PDF files found in {directory_path}")
return
print(f"Found {len(pdf_files)} PDF files. Processing...\n")
# Initialize the model
model = genai.GenerativeModel(
model_name=MODEL_NAME,
system_instruction=SYSTEM_PROMPT
)
for pdf_path in pdf_files:
file_name = os.path.basename(pdf_path)
print(f"--- Processing: {file_name} ---")
try:
# 1. Upload the file to Gemini
# Note: File API uploads act as temporary resources for the LLM
print(f" Uploading {file_name}...")
sample_file = genai.upload_file(path=pdf_path, display_name=file_name)
# Wait for file to be active (usually instant for small PDFs, but good practice)
while sample_file.state.name == "PROCESSING":
print(" Waiting for file processing...")
time.sleep(1)
sample_file = genai.get_file(sample_file.name)
if sample_file.state.name == "FAILED":
print(" File processing failed.")
continue
# 2. Generate Content
# We pass the uploaded file object directly to the model
print(" Analyzing with Gemini...")
response = model.generate_content(
[sample_file, "Generate the Hazard Report for this document."],
generation_config=genai.types.GenerationConfig(
temperature=0.1 # Low temp for deterministic formatting
)
)
# 3. Parse the Response
raw_text = response.text
structured_data = parse_hazard_report(raw_text)
if structured_data:
print(f" [SUCCESS] Title Extracted: {structured_data.get('Title')}")
# Here you would typically save 'structured_data' to a DB or JSON file
# Example: print(structured_data)
else:
print(" [FAIL] Could not parse LLM response.")
print(f" Raw Output: {raw_text[:100]}...") # Print preview of failure
# 4. Clean up (Optional but recommended)
# Delete the file from Google's server to keep your project clean
genai.delete_file(sample_file.name)
except Exception as e:
print(f" [ERROR] An exception occurred: {e}")
print("-" * 30)
if __name__ == "__main__":
# You can change this to input() to ask the user for the path at runtime
user_specified_dir = input("Enter the path to your PDF directory: ")
process_pdfs_in_directory(user_specified_dir)
Key Implementation Details
genai.upload_file: Gemini 1.5 allows you to upload PDFs directly. The model can "see" the document content (text and images) natively without you needing a local library likePyPDF2to extract text first. This is much more accurate for complex layouts.temperature=0.1: I set the generation config temperature to a low value. This reduces "creativity" and strictly enforces the formatting rules defined in the system prompt.Markdown Cleaning: I added a small check in
parse_hazard_reportto strip```pythonblocks. Even with strict system instructions, LLMs sometimes add markdown wrappers out of habit; this makes the parser robust against that.
Would you like to add a step to save these parsed results into a JSON file or a Pandas DataFrame?
Pandas DataFrame
Here is the updated script. It aggregates all the parsed reports into a list and creates a Pandas DataFrame at the end.
I have also added a step to save the DataFrame to a CSV file (hazard_reports.csv) so you can open it in Excel.
Complete Script with Pandas Integration
import os
import glob
import time
import ast
import pandas as pd
import google.generativeai as genai
# --- CONFIGURATION ---
API_KEY = "YOUR_GOOGLE_API_KEY" # Replace with your actual key
MODEL_NAME = "gemini-2.5-flash"
OUTPUT_CSV_NAME = "hazard_reports.csv"
# Configure the SDK
genai.configure(api_key=API_KEY)
# --- SYSTEM INSTRUCTION ---
SYSTEM_PROMPT = """
You are a Safety and Mission Assurance automated assistant. Your sole purpose is to analyze hazard report PDF documents and output a structured Hazard Report Summary.
Output Format Rules:
1. You must respond ONLY with a Python dictionary assigned to the variable 'Hazard_Report'.
2. Do not use Markdown formatting (no ``` code blocks).
3. Do not include conversational filler.
4. Follow this exact schema:
Hazard_Report = {
'Title': 'Short summary of the hazard',
'Description': 'Detailed description of the unsafe condition',
'Causes': [
{'Cause': 'Specific root cause or contributing factor', 'Controls': ['Control 1', 'Control 2']},
{'Cause': 'Another root cause', 'Controls': ['Control A', 'Control B']}
],'Transfers': [ {Transfer': 'Specific transfer', 'Type':'In'}, {Transfer': 'Another transfer', 'Type':'Out'}]
}
Instruction:
Analyze the provided hazard report PDF document. Extract the hazard title, description, distinct causes, any existing or suggested controls, and transfers in or out. Map them into the schema above.
"""
def parse_hazard_report(llm_raw_output):
"""Parses LLM string output into a Python dictionary."""
target_prefix = "Hazard_Report ="
clean_str = llm_raw_output.strip()
# Strip markdown code blocks if present
if clean_str.startswith("```python"):
clean_str = clean_str.replace("```python", "").replace("```", "")
elif clean_str.startswith("```"):
clean_str = clean_str.replace("```", "")
if clean_str.startswith(target_prefix):
clean_str = clean_str[len(target_prefix):].strip()
try:
return ast.literal_eval(clean_str)
except (SyntaxError, ValueError) as e:
print(f" [!] Parsing Error: {e}")
return None
def process_pdfs_in_directory(directory_path):
if not os.path.exists(directory_path):
print(f"Error: Directory '{directory_path}' not found.")
return
pdf_files = glob.glob(os.path.join(directory_path, "*.pdf"))
if not pdf_files:
print(f"No PDF files found in {directory_path}")
return
print(f"Found {len(pdf_files)} PDF files. Processing...\n")
model = genai.GenerativeModel(model_name=MODEL_NAME, system_instruction=SYSTEM_PROMPT)
# List to hold all report dictionaries
all_reports = []
for pdf_path in pdf_files:
file_name = os.path.basename(pdf_path)
print(f"--- Processing: {file_name} ---")
try:
# 1. Upload File
sample_file = genai.upload_file(path=pdf_path, display_name=file_name)
while sample_file.state.name == "PROCESSING":
time.sleep(1)
sample_file = genai.get_file(sample_file.name)
if sample_file.state.name == "FAILED":
print(" File processing failed.")
continue
# 2. Generate Content
response = model.generate_content(
[sample_file, "Generate the Hazard Report Summary for this document."],
generation_config=genai.types.GenerationConfig(temperature=0.1)
)
# 3. Parse Response
structured_data = parse_hazard_report(response.text)
if structured_data:
print(f" [SUCCESS] Title Extracted: {structured_data.get('Title')}")
# Add the Source File name to the data for traceability
structured_data['Source_File'] = file_name
# Append to our list
all_reports.append(structured_data)
else:
print(" [FAIL] Could not parse LLM response.")
# 4. Clean up
genai.delete_file(sample_file.name)
except Exception as e:
print(f" [ERROR] Exception: {e}")
print("-" * 30)
# --- PANDAS DATAFRAME CREATION ---
if all_reports:
df = pd.DataFrame(all_reports)
# Reorder columns to put 'Source_File' and 'Title' first for readability
cols = ['Source_File', 'Title', 'Description', 'Causes', 'Transfers']
# Handle cases where columns might be missing if LLM failed partially
existing_cols = [c for c in cols if c in df.columns]
df = df[existing_cols]
print("\n--- DataFrame Summary ---")
print(df.head())
# Save to CSV
df.to_csv(OUTPUT_CSV_NAME, index=False)
print(f"\nSuccessfully saved reports to '{OUTPUT_CSV_NAME}'")
else:
print("\nNo reports were successfully generated.")
if __name__ == "__main__":
user_specified_dir = input("Enter the path to your PDF directory: ")
process_pdfs_in_directory(user_specified_dir)
How the Data will look
Since Causes is a list of dictionaries, the CSV output for that specific column will look like a string representation of the list (e.g., "[ {'Cause': '...', 'Controls': [...]}, ... ]").
If you need to "flatten" the data (so that every single Cause has its own row in the spreadsheet), you can use df.explode('Causes') immediately after creating the DataFrame. Let me know if you would like that modification.
From <https://gemini.google.com/app/> Google Gemini 3.0 Pro (Thinking)

Comments
Post a Comment