#3933 Instructions for Hazard Report Summary

#3933 Provide instructions to an LLM to respond with this content:

Hazard Report = {
'Title':'title',
'Description':'description',
'Causes': [
'Cause':'cause', 'Controls': [],
'Cause':'cause', 'Controls': []
]
}

Here are precise instructions you can paste into an LLM (like Gemini, ChatGPT, or Claude) to force it to output that specific structure.

I have designed this as a System Prompt. You should paste this in first to define the AI's behavior, and then provide your scenario.

The System Prompt

Copy and paste the text in the block below:

You are a Safety and Mission Assurance automated assistant. Your sole purpose is to analyze hazard reports and output a structured Hazard Report Summary.

Output Format Rules:

You must respond only with a Python dictionary assigned to the variable Hazard Report.
Do not use Markdown formatting (no ``` code blocks).
Do not include conversational filler (e.g., "Here is the report").
Follow this exact schema:

Hazard Report = {
'Title': 'Short summary of the hazard',
'Description': 'Detailed description of the unsafe condition',
'Causes': [
  {'Cause': 'Specific cause or contributing factor', 'Controls': ['Control 1', 'Control 2']},
  {'Cause': 'Another cause', 'Controls': ['Control A', 'Control B']}
]
}

Instruction: Analyze the hazard report provided by the user. Extract the hazard title, description, distinct causes, and any existing or suggested controls. Map them into the schema above.

Example Usage

Your Prompt:

[Paste the System Prompt above first]
[Paste or Upload/Attach hazard report]

LLM Response:

Hazard Report = {
'Title': 'Fall Hazard at Cryo-Tank Valve Assembly',
'Description': 'Technician required to lean over unguarded railing to access pressure gauge on slick surface.',
'Causes': [
  {'Cause': 'Unguarded railing near maintenance point', 'Controls': ['Install permanent guardrails', 'Relocate pressure gauge to accessible height']},
  {'Cause': 'Slick floor due to condensation', 'Controls': ['Apply non-slip coating to floor', 'Manage condensation drainage']},
  {'Cause': 'Lack of fall protection equipment', 'Controls': ['Mandatory harness usage for valve access', 'Install anchor points']}
]
}

Why this works

Role Definition: It assigns the AI a "Safety and Mission Assurance" role, which primes it to look for technical causes and controls rather than generic advice.
Negative Constraints: Explicitly forbidding Markdown (```) and conversational filler ensures you get raw text that is ready to be pasted directly into a code editor or parsed by a script.
One-shot Example: By including the schema in the prompt, the LLM understands exactly how to nest the Causes list and the Controls list.

Would you like me to write a Python script to parse this string output into a usable object?

Yes

Here is a Python script designed to ingest the raw string output from the LLM and convert it into a usable dictionary object.

I have used the ast (Abstract Syntax Tree) library rather than json or eval().

Why not json? The LLM format uses single quotes ('Title') and Python-specific syntax, which will cause json.loads to fail.
Why not eval()? eval() poses a security risk if the LLM hallucinates malicious code. ast.literal_eval safely evaluates only Python literals (strings, numbers, tuples, lists, dicts).

Python Parser Script

import ast

def parse_hazard_report(llm_raw_output):
    """
    Parses a string containing a Python dictionary assignment 
    into a native Python dictionary object.
    """
    # 1. Clean the string
    # The LLM outputs "Hazard Report = { ... }"
    # We need to strip the variable assignment to get just the dict structure: "{ ... }"
    target_prefix = "Hazard Report ="
    
    clean_str = llm_raw_output.strip()
    
    if clean_str.startswith(target_prefix):
        # Slice off the prefix
        clean_str = clean_str[len(target_prefix):].strip()
    
    try:
        # 2. Safely evaluate the string
        # ast.literal_eval parses the string as a Python literal structure
        # without executing functions or arbitrary code.
        data_object = ast.literal_eval(clean_str)
        return data_object
        
    except SyntaxError as e:
        print(f"Parsing Failed: The LLM output was not valid Python syntax.\nError: {e}")
        return None
    except ValueError as e:
        print(f"Parsing Failed: The LLM output contained unsafe or complex constructs.\nError: {e}")
        return None

# --- Example Usage ---

# This represents the raw string returned by the LLM
llm_response_string = """
Hazard Report = {
'Title': 'Fall Hazard at Cryo-Tank Valve Assembly',
'Description': 'Technician required to lean over unguarded railing to access pressure gauge.',
'Causes': [
  {'Cause': 'Unguarded railing near maintenance point', 'Controls': ['Install permanent guardrails']},
  {'Cause': 'Slick floor due to condensation', 'Controls': ['Apply non-slip coating', 'Drainage mgmt']}
]
}
"""

# Parse the string
report_dict = parse_hazard_report(llm_response_string)

# Verify it works by accessing keys programmatically
if report_dict:
    print(f"Hazard Report: {report_dict['Title']}\n")
    print(f"Description: {report_dict['Description']}\n")    for i, entry in enumerate(report_dict['Causes'], 1):
        print(f"Cause {i}: {entry['Cause']}")
        print(f"   Controls: {', '.join(entry['Controls'])}")

Output of the Script:

Hazard Report: Fall Hazard at Cryo-Tank Valve Assembly
Description: Technician required to lean over unguarded railing to access pressure gauge.
Cause 1: Unguarded railing near maintenance point
   Controls: Install permanent guardrails
Cause 2: Slick floor due to condensation
   Controls: Apply non-slip coating, Drainage mgmt

Next Steps for Implementation

Since you are likely integrating this into a larger pipeline:

Validation: You might want to add a Pydantic model step immediately after parsing to ensure the dictionary strictly adheres to your schema (e.g., ensuring Causes is always a list).
Storage: Once parsed into report_dict, you can easily dump this to a JSON file (json.dump), a SQL database, or a CSV row.

Would you like me to create a Pydantic model to validate the data structure after parsing?

Add python code to read each PDF file from a user-specified directory and send the file, system instructions and user prompt to Google Gemini LLM, then pass the response to parse_hazard_report()

Here is the complete Python workflow. This script uses the google-generativeai SDK to process the PDFs and integrates the parser we built earlier.

Prerequisites

You will need to install the Google Generative AI library:

pip install google-generativeai

Complete Python Script

import os
import glob
import time
import ast
import google.generativeai as genai
from google.generativeai.types import HarmCategory, HarmBlockThreshold

# --- CONFIGURATION ---
API_KEY = "YOUR_GOOGLE_API_KEY"  # Replace with your actual key
MODEL_NAME = "gemini-1.5-flash"  # Flash is faster/cheaper for text extraction tasks

# Configure the SDK
genai.configure(api_key=API_KEY)

# --- SYSTEM INSTRUCTION ---
# This matches the prompt we designed earlier to force the specific Python dict format.
SYSTEM_PROMPT = """
You are a Safety and Mission Assurance automated assistant. Your sole purpose is to analyze input PDF documents and output a structured Hazard Report.

Output Format Rules:
1. You must respond ONLY with a Python dictionary assigned to the variable 'Hazard Report'.
2. Do not use Markdown formatting (no ``` code blocks).
3. Do not include conversational filler.
4. Follow this exact schema:

Hazard Report = {
'Title': 'Short summary of the hazard',
'Description': 'Detailed description of the unsafe condition',
'Causes': [
  {'Cause': 'Specific root cause or contributing factor', 'Controls': ['Control 1', 'Control 2']},
  {'Cause': 'Another root cause', 'Controls': ['Control A', 'Control B']}
]
}

Instruction:
Analyze the provided PDF document. Extract the hazard title, description, distinct causes, and any existing or suggested controls. Map them into the schema above.
"""

def parse_hazard_report(llm_raw_output):
    """
    Parses a string containing a Python dictionary assignment 
    into a native Python dictionary object.
    """
    target_prefix = "Hazard Report ="
    clean_str = llm_raw_output.strip()
    
    # Strip markdown code blocks if the LLM ignores instructions and adds them anyway
    if clean_str.startswith("```python"):
        clean_str = clean_str.replace("```python", "").replace("```", "")
    elif clean_str.startswith("```"):
        clean_str = clean_str.replace("```", "")
        
    if clean_str.startswith(target_prefix):
        clean_str = clean_str[len(target_prefix):].strip()
    
    try:
        data_object = ast.literal_eval(clean_str)
        return data_object
    except (SyntaxError, ValueError) as e:
        print(f"  [!] Parsing Error: {e}")
        return None

def process_pdfs_in_directory(directory_path):
    # Check if directory exists
    if not os.path.exists(directory_path):
        print(f"Error: Directory '{directory_path}' not found.")
        return

    # Find all PDF files
    pdf_files = glob.glob(os.path.join(directory_path, "*.pdf"))
    
    if not pdf_files:
        print(f"No PDF files found in {directory_path}")
        return

    print(f"Found {len(pdf_files)} PDF files. Processing...\n")

    # Initialize the model
    model = genai.GenerativeModel(
        model_name=MODEL_NAME,
        system_instruction=SYSTEM_PROMPT
    )

    for pdf_path in pdf_files:
        file_name = os.path.basename(pdf_path)
        print(f"--- Processing: {file_name} ---")

        try:
            # 1. Upload the file to Gemini
            # Note: File API uploads act as temporary resources for the LLM
            print(f"  Uploading {file_name}...")
            sample_file = genai.upload_file(path=pdf_path, display_name=file_name)

            # Wait for file to be active (usually instant for small PDFs, but good practice)
            while sample_file.state.name == "PROCESSING":
                print("  Waiting for file processing...")
                time.sleep(1)
                sample_file = genai.get_file(sample_file.name)

            if sample_file.state.name == "FAILED":
                print("  File processing failed.")
                continue

            # 2. Generate Content
            # We pass the uploaded file object directly to the model
            print("  Analyzing with Gemini...")
            response = model.generate_content(
                [sample_file, "Generate the Hazard Report for this document."],
                generation_config=genai.types.GenerationConfig(
                    temperature=0.1 # Low temp for deterministic formatting
                )
            )

            # 3. Parse the Response
            raw_text = response.text
            structured_data = parse_hazard_report(raw_text)

            if structured_data:
                print(f"  [SUCCESS] Title Extracted: {structured_data.get('Title')}")
                # Here you would typically save 'structured_data' to a DB or JSON file
                # Example: print(structured_data)
            else:
                print("  [FAIL] Could not parse LLM response.")
                print(f"  Raw Output: {raw_text[:100]}...") # Print preview of failure

            # 4. Clean up (Optional but recommended)
            # Delete the file from Google's server to keep your project clean
            genai.delete_file(sample_file.name)

        except Exception as e:
            print(f"  [ERROR] An exception occurred: {e}")
        
        print("-" * 30)

if __name__ == "__main__":
    # You can change this to input() to ask the user for the path at runtime
    user_specified_dir = input("Enter the path to your PDF directory: ")
    process_pdfs_in_directory(user_specified_dir)

Key Implementation Details

genai.upload_file: Gemini 1.5 allows you to upload PDFs directly. The model can "see" the document content (text and images) natively without you needing a local library like PyPDF2 to extract text first. This is much more accurate for complex layouts.
temperature=0.1: I set the generation config temperature to a low value. This reduces "creativity" and strictly enforces the formatting rules defined in the system prompt.
Markdown Cleaning: I added a small check in parse_hazard_report to strip ```python blocks. Even with strict system instructions, LLMs sometimes add markdown wrappers out of habit; this makes the parser robust against that.

Would you like to add a step to save these parsed results into a JSON file or a Pandas DataFrame?

Pandas DataFrame

Here is the updated script. It aggregates all the parsed reports into a list and creates a Pandas DataFrame at the end.

I have also added a step to save the DataFrame to a CSV file (hazard_reports.csv) so you can open it in Excel.

Complete Script with Pandas Integration

import os
import glob
import time
import ast
import pandas as pd
import google.generativeai as genai

# --- CONFIGURATION ---
API_KEY = "YOUR_GOOGLE_API_KEY"  # Replace with your actual key
MODEL_NAME = "gemini-2.5-flash"
OUTPUT_CSV_NAME = "hazard_reports.csv"

# Configure the SDK
genai.configure(api_key=API_KEY)

# --- SYSTEM INSTRUCTION ---
SYSTEM_PROMPT = """
You are a Safety and Mission Assurance automated assistant. Your sole purpose is to analyze hazard report PDF documents and output a structured Hazard Report Summary.

Output Format Rules:
1. You must respond ONLY with a Python dictionary assigned to the variable 'Hazard_Report'.
2. Do not use Markdown formatting (no ``` code blocks).
3. Do not include conversational filler.
4. Follow this exact schema:

Hazard_Report = {
'Title': 'Short summary of the hazard',
'Description': 'Detailed description of the unsafe condition',
'Causes': [
  {'Cause': 'Specific root cause or contributing factor', 'Controls': ['Control 1', 'Control 2']},
  {'Cause': 'Another root cause', 'Controls': ['Control A', 'Control B']}
],'Transfers': [  {Transfer': 'Specific transfer', 'Type':'In'},  {Transfer': 'Another transfer', 'Type':'Out'}]
}

Instruction:
Analyze the provided hazard report PDF document. Extract the hazard title, description, distinct causes, any existing or suggested controls, and transfers in or out. Map them into the schema above.
"""

def parse_hazard_report(llm_raw_output):
    """Parses LLM string output into a Python dictionary."""
    target_prefix = "Hazard_Report ="
    clean_str = llm_raw_output.strip()
    
    # Strip markdown code blocks if present
    if clean_str.startswith("```python"):
        clean_str = clean_str.replace("```python", "").replace("```", "")
    elif clean_str.startswith("```"):
        clean_str = clean_str.replace("```", "")
        
    if clean_str.startswith(target_prefix):
        clean_str = clean_str[len(target_prefix):].strip()
    
    try:
        return ast.literal_eval(clean_str)
    except (SyntaxError, ValueError) as e:
        print(f"  [!] Parsing Error: {e}")
        return None

def process_pdfs_in_directory(directory_path):
    if not os.path.exists(directory_path):
        print(f"Error: Directory '{directory_path}' not found.")
        return

    pdf_files = glob.glob(os.path.join(directory_path, "*.pdf"))
    
    if not pdf_files:
        print(f"No PDF files found in {directory_path}")
        return

    print(f"Found {len(pdf_files)} PDF files. Processing...\n")

    model = genai.GenerativeModel(model_name=MODEL_NAME, system_instruction=SYSTEM_PROMPT)
    
    # List to hold all report dictionaries
    all_reports = []

    for pdf_path in pdf_files:
        file_name = os.path.basename(pdf_path)
        print(f"--- Processing: {file_name} ---")

        try:
            # 1. Upload File
            sample_file = genai.upload_file(path=pdf_path, display_name=file_name)

            while sample_file.state.name == "PROCESSING":
                time.sleep(1)
                sample_file = genai.get_file(sample_file.name)

            if sample_file.state.name == "FAILED":
                print("  File processing failed.")
                continue

            # 2. Generate Content
            response = model.generate_content(
                [sample_file, "Generate the Hazard Report Summary for this document."],
                generation_config=genai.types.GenerationConfig(temperature=0.1)
            )

            # 3. Parse Response
            structured_data = parse_hazard_report(response.text)

            if structured_data:
                print(f"  [SUCCESS] Title Extracted: {structured_data.get('Title')}")
                
                # Add the Source File name to the data for traceability
                structured_data['Source_File'] = file_name
                
                # Append to our list
                all_reports.append(structured_data)
            else:
                print("  [FAIL] Could not parse LLM response.")

            # 4. Clean up
            genai.delete_file(sample_file.name)

        except Exception as e:
            print(f"  [ERROR] Exception: {e}")
        
        print("-" * 30)

    # --- PANDAS DATAFRAME CREATION ---
    if all_reports:
        df = pd.DataFrame(all_reports)
        
        # Reorder columns to put 'Source_File' and 'Title' first for readability
        cols = ['Source_File', 'Title', 'Description', 'Causes', 'Transfers']
        # Handle cases where columns might be missing if LLM failed partially
        existing_cols = [c for c in cols if c in df.columns]
        df = df[existing_cols]

        print("\n--- DataFrame Summary ---")
        print(df.head())

        # Save to CSV
        df.to_csv(OUTPUT_CSV_NAME, index=False)
        print(f"\nSuccessfully saved reports to '{OUTPUT_CSV_NAME}'")
    else:
        print("\nNo reports were successfully generated.")

if __name__ == "__main__":
    user_specified_dir = input("Enter the path to your PDF directory: ")
    process_pdfs_in_directory(user_specified_dir)

How the Data will look

Since Causes is a list of dictionaries, the CSV output for that specific column will look like a string representation of the list (e.g., "[ {'Cause': '...', 'Controls': [...]}, ... ]").

If you need to "flatten" the data (so that every single Cause has its own row in the spreadsheet), you can use df.explode('Causes') immediately after creating the DataFrame. Let me know if you would like that modification.

From <https://gemini.google.com/app/> Google Gemini 3.0 Pro (Thinking)

Search This Blog

Learning from ChatGPT