#4121 Hazard Report Named Entities

#4121 I have a directory of hazard reports in PDF format for which I want to create lists of named entities for each hazard report. A second pass would cross-check the hazard lists for commonality.

That sounds like a classic Information Extraction (IE) pipeline. Dealing with PDFs adds a layer of "fun" (read: formatting headaches), but once you have the text cleaned up, the entity extraction and cross-referencing are very manageable.

Here is a streamlined approach to building this workflow.

Phase 1: Text Extraction & Normalization

PDFs are notorious for messy layouts. For hazard reports, which often contain tables or diagrams, you need a robust parser.

Tools: Use PyMuPDF (fitz) for speed or unstructured for high-quality layout detection.
The Goal: Convert each PDF into a clean string or a structured JSON object.

Phase 2: Named Entity Recognition (NER)

You’ll want to extract specific categories like Chemicals, Equipment, Locations, Job Titles, or Regulations.

The "Off-the-Shelf" Way: Use spaCy or Hugging Face transformers. They have pre-trained models that find people and places easily, but hazard reports often need custom labels.
The LLM Way: If the volume isn't massive (thousands vs. millions), passing chunks of text to an LLM with a specific prompt (e.g., "Extract all hazardous materials and safety gear mentioned") is often more accurate than a generic NER model.

Phase 3: Commonality Cross-Check

Once you have lists for each file, you need to find the overlap.

Entity Linking: You might run into "HCl" in one report and "Hydrochloric Acid" in another. To cross-check accurately, you should use a normalization step (converting synonyms to a single "canonical" name).
Intersection Analysis: Using Python’s set() logic or a pandas DataFrame, you can quickly see which entities appear in 80% of reports versus which are unique outliers.

Suggested Tech Stack

Step	Tool Recommendation	Why?
Parsing	`PyMuPDF` or `Marker`	Best at handling multi-column safety layouts.
Extraction	`spaCy` (GLiNER)	GLiNER allows for "zero-shot" entity extraction of custom labels.
Analysis	pandas	Perfect for counting occurrences across the directory.
Storage	`SQLite` or `JSON`	Keeps the extracted data portable for the second pass.

How I can help right now

To give you a more tailored recommendation, I need to know the scale of your project:

Volume: Are we talking about 50 reports or 5,000?
Format: Are these mostly text-heavy, or do they rely on tables/forms?
Specific Entities: Beyond names and dates, are you looking for specific "Safety" entities (e.g., CAS numbers, PPE types)?

Would you like me to write a Python script template that extracts text from a directory of PDFs and identifies basic entities?

Yes

This script uses a combination of PyMuPDF (fastest for text extraction) and spaCy (best for NLP) to process your directory.

To handle the "second pass," I’ve included a Jaccard Similarity check and a Pandas-based commonality aggregator. This will identify which entities (chemicals, parts, etc.) are appearing across multiple hazard reports.

Prerequisites

You will need to install the following and download the English language model:

pip install pymupdf spacy pandas
python -m spacy download en_core_web_sm

The Solution Script

import fitz  # PyMuPDF
import spacy
import pandas as pd
import os
from collections import Counter

# 1. Setup NER Model
# Note: 'en_core_web_sm' is standard. For technical hazard reports, 
# you may later want to train a custom model for 'CHEMICAL' or 'EQUIPMENT'.
nlp = spacy.load("en_core_web_sm")

def extract_text_from_pdf(pdf_path):
    """Extracts raw text from a PDF file."""
    text = ""
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text()
    return text

def get_entities(text):
    """Extracts unique entities, focusing on Orgs, Products, and GPE (Locations)."""
    doc = nlp(text)
    # We filter for specific labels often found in safety/hazard reports
    # ORG = Companies/Agencies, PRODUCT = Vehicles/Equipment, GPE = Locations
    entities = {ent.text.strip().lower() for ent in doc.ents if ent.label_ in ["ORG", "PRODUCT", "GPE", "LAW"]}
    return list(entities)

def process_directory(directory_path):
    all_report_data = []
    
    print(f"--- Phase 1: Extracting Entities from {directory_path} ---")
    for filename in os.listdir(directory_path):
        if filename.endswith(".pdf"):
            path = os.path.join(directory_path, filename)
            text = extract_text_from_pdf(path)
            entities = get_entities(text)
            
            all_report_data.append({
                "filename": filename,
                "entities": entities
            })
            print(f"Processed: {filename} (Found {len(entities)} entities)")
            
    return all_report_data

def analyze_commonality(data):
    """Second Pass: Cross-check lists for commonality."""
    print("\n--- Phase 2: Analyzing Commonality ---")
    
    # Flatten all entities to count global frequency
    all_entities_flat = [ent for report in data for ent in report['entities']]
    entity_counts = Counter(all_entities_flat)
    
    # Create a summary DataFrame
    common_df = pd.DataFrame(entity_counts.items(), columns=['Entity', 'Report_Count'])
    common_df = common_df.sort_values(by='Report_Count', ascending=False)
    
    return common_df

# --- Execution ---
# Replace 'reports' with your actual folder path
folder_path = "./hazard_reports" 
if not os.path.exists(folder_path):
    os.makedirs(folder_path)
    print(f"Please place your PDFs in '{folder_path}' and run again.")
else:
    results = process_directory(folder_path)
    commonality_report = analyze_commonality(results)
    
    print("\nTop 10 Most Common Entities Across Reports:")
    print(commonality_report.head(10))

    # Optional: Save to CSV
    # commonality_report.to_csv("hazard_commonality_analysis.csv", index=False)

How this addresses your workflow:

First Pass (NER Extraction): It loops through your directory, opens each PDF, and uses spaCy to "read" it. It specifically looks for Organizations, Laws, and Products—the typical building blocks of a hazard report.
Normalization: The script uses .lower().strip() to ensure that "Oxygen" and "oxygen " are treated as the same entity.
Second Pass (Cross-check): The Counter object acts as a pivot table, showing you exactly how many reports share a specific entity.

Why this might need a "Step 1.5":

Hazard reports often use highly specific jargon (e.g., "CAS 64-17-5" for Ethanol). The standard spaCy model might not recognize these out of the box.

Would you like me to show you how to add a "Ruler" to this script to catch specific hazard codes or chemicals that generic AI might miss?

From <https://gemini.google.com/app/> Google Gemini 3.0 Pro (Thinking)

Search This Blog

Learning from ChatGPT