#3936 Deterministic PDF Queries

#3936 Deterministic PDF Queries

#3936 Adobe PDF. Describe a method to use deterministic queries on PDF files.

To perform deterministic queries on a PDF, you must bypass probabilistic methods (like AI/LLMs) and instead rely on rule-based extraction. Because PDFs are essentially instructions on "where to paint text," deterministic querying relies on anchoring your search to specific coordinates (spatial) or text patterns (regex).

The most robust method for this is using Python with a library that exposes layout-aware text extraction. Below is a method using pdfplumber, which is highly effective for deterministic extraction because it allows you to query the PDF based on physical layout.

The Method: Layout-Anchored Extraction

This method treats the PDF page like a coordinate grid. You define "regions of interest" (like a database schema) and programmatically extract data only from those specific boxes.

Prerequisites

You will need Python and the library pdfplumber.

pip install pdfplumber

Step-by-Step Implementation

1. Define the "Query" (The Bounding Box)

Unlike a SQL query which selects by column name, a PDF deterministic query selects by geometry. You must define the area where the data lives.

  • Top/Bottom: The vertical coordinates (y axis).

  • Left/Right: The horizontal coordinates (x axis).

2. Execute the Extraction

Load the PDF, navigate to the specific page, crop the page to your bounding box, and extract the text.

3. Apply Pattern Validation (Regex)

Once you have the text from the specific box, use Regular Expressions (Regex) to ensure the data matches the expected format (e.g., a date YYYY-MM-DD or an Invoice ID).

Code Example: The "Spatial Query"

Here is a Python script that acts as a deterministic query engine. It looks for a specific value (e.g., "Total Amount") in a fixed location on the page.

import pdfplumber
import re

def execute_deterministic_query(pdf_path, page_num, bbox, regex_pattern=None):
    """
    Args:
        pdf_path (str): Path to the PDF file.
        page_num (int): The page number to query (0-indexed).
        bbox (tuple): The spatial query (x0, top, x1, bottom).
        regex_pattern (str): Optional regex to validate/clean the result.
    """
    results = []
    
    with pdfplumber.open(pdf_path) as pdf:
        # 1. Select the Page
        page = pdf.pages[page_num]
        
        # 2. Execute Spatial Query (Crop)
        # We crop the page to the specific "cell" we care about
        cropped_area = page.crop(bbox)
        
        # 3. Extract Text (and layout info if needed)
        extracted_text = cropped_area.extract_text()
        
        # 4. Post-Processing (Pattern Validation)
        if regex_pattern and extracted_text:
            match = re.search(regex_pattern, extracted_text)
            if match:
                results.append(match.group(0))
            else:
                results.append(None) # Pattern failed, query returned specific structure but invalid data
        else:
            results.append(extracted_text)
            
    return results

# --- EXAMPLE USAGE ---

# "Query": Select the area where the 'Total' is usually located on an invoice.
# Coordinates: (Left: 400, Top: 600, Right: 550, Bottom: 650)
query_box = (400, 600, 550, 650)

# "Constraint": The result must look like a currency (e.g., $1,234.56)
money_pattern = r"\$\d{1,3}(,\d{3})*(\.\d{2})?"

# Run
data = execute_deterministic_query("invoice_001.pdf", 0, query_box, money_pattern)
print(f"Extracted Value: {data}")

Why This is Deterministic

  1. Input Stability: If you feed the exact same PDF into this script, you will get the exact same output 100% of the time. There is no "temperature" or inference involved.

  2. Hard Boundaries: The query fails explicitly if the data is not in the bbox (bounding box), rather than hallucinating a plausible answer.

  3. Verifiable: You can visually map the bbox coordinates to the PDF page to audit exactly where the data came from.

Alternative: pdfquery (Selector-Based)

If you prefer a syntax closer to CSS or jQuery rather than calculating raw coordinates, you can use the pdfquery library. It converts the PDF into an XML tree, allowing you to run queries like:

"Find the text element that is visually below the text element 'Total:' and aligned to the right."

import pdfquery

pdf = pdfquery.PDFQuery("document.pdf")
pdf.load()

# Query: Find text with label "Total", then grab the text immediately to its right
label = pdf.pq('LTTextLineHorizontal:contains("Total")')
value = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (
    label.attr('x1'), label.attr('y0'), float(label.attr('x1')) + 100, label.attr('y1')
)).text()

Next Step

Would you like me to generate a script that helps you visualize the coordinates of a PDF so you can easily define these bounding boxes?


Provide a prompt I can submit to an LLM to identify query-able labels in a PDF and generate REGEX for each

Here is a robust prompt you can use. This prompt is designed to turn an LLM into a "Regex Architect" that analyzes your document's structure and outputs precise, Python-compatible regular expressions.

You should use this prompt with an LLM that allows file uploads (like Gemini 1.5 Pro, Claude 3.5 Sonnet, or GPT-4o) so it can "see" the visual layout, or you can paste the raw text if the layout is simple.

The Prompt

Act as a Data Extraction Engineer specializing in Python and Regular Expressions.

Your Goal: Analyze the provided PDF content and identify "query-able" fields (Key-Value pairs). For each identified field, generate a precise, deterministic Python re pattern to extract the value based on the label.

Instructions:

  1. Scan the Document: Look for standard document labels (e.g., "Invoice Number:", "Date:", "Total:", "Part #").

  2. Identify Structure: Determine if the value is to the right of the label, below the label, or part of a table.

  3. Generate Regex: Create a regex pattern that uses the label as an anchor (a lookbehind) to capture the value.

  • Constraint: The regex must be specific to the data type (e.g., use \d{2}/\d{2}/\d{4} for dates, not .*).

  • Constraint: Account for variable whitespace between the label and the value (\s+).

  • Constraint: Handle potential OCR noise if applicable.

  1. Output Format: Provide the result as a JSON list with the following keys:

  • field_name: A standardized variable name (e.g., invoice_date).

  • anchor_text: The literal label text found in the PDF.

  • sample_value: The actual value found in this specific document.

  • regex_pattern: The raw Python regex string.

Example Output Structure:

[
  {
    "field_name": "invoice_total",
    "anchor_text": "Total Due:",
    "sample_value": "$4,500.00",
    "regex_pattern": "(?<=Total Due:)\\s*\\$[\\d,]+\\.\\d{2}"
  }
]

Here is the document content: [UPLOAD PDF OR PASTE TEXT HERE]


How to use this effectively

  1. For best results (Visual Anchoring): If you use a model like Gemini 1.5 Pro or GPT-4o, upload the PDF file directly. These models can "see" the spatial relationship (e.g., the "Total" is at the bottom right) and generate regex that accounts for newlines if the layout is tricky.

  2. For text-only models: If you are pasting raw text (extracted via pdfplumber or similar), the layout might be flattened. In this case, add this line to the prompt:

    "Note: The text provided is a raw dump. Please account for newlines \n that might separate a label from its value."

Next Step

Would you like a Python script that takes this JSON output and automatically runs it against a folder of PDFs to test if the Regex patterns hold up?


Yes

Here is the Validation Script.

This script acts as a "unit test" for your extraction logic. It iterates through a folder of PDFs, applies the regex patterns defined in your JSON, and generates a CSV report detailing exactly which files matched and which failed.

This approach aligns with Verification & Validation (V&V) principles—essential for reliable systems engineering.

1. The Configuration (query_rules.json)

First, save the JSON output from the LLM into a file named query_rules.json. It should look like this:

[
  {
    "field_name": "invoice_date",
    "regex_pattern": "(?<=Date:)\\s*\\d{2}/\\d{2}/\\d{4}"
  },
  {
    "field_name": "total_amount",
    "regex_pattern": "(?<=Total:)\\s*\\$[\\d,]+\\.\\d{2}"
  }
]

2. The Validation Script (validate_extraction.py)

import pdfplumber
import re
import json
import csv
from pathlib import Path

# --- CONFIGURATION ---
RULES_FILE = "query_rules.json"
PDF_FOLDER = "./test_pdfs"  # Folder containing your sample PDFs
REPORT_FILE = "extraction_validation_report.csv"

def load_rules(filepath):
    """Loads the extraction rules from JSON."""
    try:
        with open(filepath, 'r') as f:
            return json.load(f)
    except FileNotFoundError:
        print(f"Error: Rules file '{filepath}' not found.")
        return []

def extract_text_from_pdf(pdf_path):
    """Extracts raw text from a PDF for regex scanning."""
    text_content = ""
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                text_content += page.extract_text() + "\n"
    except Exception as e:
        print(f"Error reading {pdf_path}: {e}")
        return None
    return text_content

def validate_patterns():
    rules = load_rules(RULES_FILE)
    if not rules:
        return

    pdf_files = list(Path(PDF_FOLDER).glob("*.pdf"))
    print(f"Found {len(pdf_files)} PDFs to process...")

    results_log = []

    for pdf_file in pdf_files:
        print(f"Processing: {pdf_file.name}")
        full_text = extract_text_from_pdf(pdf_file)
        
        if full_text is None:
            continue

        # Create a row for this file
        row = {"filename": pdf_file.name}
        
        # Apply each rule
        for rule in rules:
            field = rule['field_name']
            pattern = rule['regex_pattern']
            
            # Execute Regex
            match = re.search(pattern, full_text, re.MULTILINE)
            
            if match:
                row[field] = match.group(0).strip() # Success: Store the value
            else:
                row[field] = "MISSING" # Failure: Pattern didn't match
        
        results_log.append(row)

    # --- GENERATE REPORT ---
    if results_log:
        # Dynamically determine CSV headers based on keys in the first result
        headers = results_log[0].keys()
        
        with open(REPORT_FILE, 'w', newline='') as f:
            writer = csv.DictWriter(f, fieldnames=headers)
            writer.writeheader()
            writer.writerows(results_log)
            
        print(f"\n✅ Validation Complete. Report saved to: {REPORT_FILE}")
    else:
        print("\n⚠️ No results to report.")

if __name__ == "__main__":
    validate_patterns()

How to Analyze the Results

Open extraction_validation_report.csv. You will see a matrix:

filenameinvoice_datetotal_amount
inv_001.pdf12/01/2024$500.00
inv_002.pdfMISSING$1,250.00
inv_003.pdf12/03/2024$75.00
  • Success: The column contains data.

  • Failure: The column says MISSING.

  • Action: If inv_002.pdf failed on "invoice_date," you open that specific PDF, look at the date format, and refine the regex in your JSON (e.g., perhaps that specific file used "Dec 1, 2024" instead of "12/01/2024").

Next Step

Would you like me to add a "Confidence Score" feature to this script, where it flags values that match the regex technically but look suspicious (e.g., a date that is 50 years in the past)?


From <https://gemini.google.com/app/>  Google Gemini 3.0 Pro (Thinking)



Create a python app to parse a PDF file, identify labeled content (e.g. Title: title, Description: description, etc.) and generate a regex table for the list of labels

This is a common task in data extraction/ETL pipelines. To achieve this, we need a solution that does three things:

  1. Extracts raw text from a PDF.

  2. Applies heuristics to identify what looks like a "Label" (usually text at the start of a line ending in a colon).

  3. Constructs a Regex pattern dynamically for each found label to capture the value associated with it.

Here is a complete, self-contained Python solution.

Prerequisites

You will need to install pypdf for reading files and pandas for formatting the output table.

pip install pypdf pandas

The Python Application (pdf_regex_generator.py)

This script includes a PDFRegexGenerator class you can use in your own projects, plus a "Demo Mode" that creates a dummy PDF to test the logic immediately.

import re
import pandas as pd
from pypdf import PdfReader
import sys
import os

class PDFRegexGenerator:
    def __init__(self):
        # Heuristic: Looks for lines starting with words and ending with a colon
        # e.g., "Invoice Number:", "Date of Birth:", "Total Cost:"
        self.label_candidate_pattern = re.compile(r"(?m)^\s*([A-Za-z][\w\s\(\)\/\-]+):")

    def extract_text(self, pdf_path):
        """Reads the PDF and returns the full text string."""
        try:
            reader = PdfReader(pdf_path)
            full_text = ""
            for page in reader.pages:
                text = page.extract_text()
                if text:
                    full_text += text + "\n"
            return full_text
        except Exception as e:
            return f"Error reading PDF: {e}"

    def generate_schema(self, text):
        """
        Scans text for labels and generates a regex table.
        Returns a Pandas DataFrame.
        """
        matches = self.label_candidate_pattern.findall(text)
        
        # Deduplicate matches while preserving order
        unique_labels = list(dict.fromkeys(matches))
        
        data = []
        for label in unique_labels:
            # We assume the label is literal text, so we escape it for safety
            safe_label = re.escape(label)
            
            # 1. Capture Value: Captures everything after the label until end of line
            capture_pattern = fr"{safe_label}:\s*(.*)"
            
            # 2. Named Group (Optional): Advanced pattern for programmatic access
            # Clean label to make it a valid regex group name (alphanumeric only)
            clean_group_name = re.sub(r'\W|^(?=\d)', '_', label).strip('_')
            named_pattern = fr"{safe_label}:\s*(?P<{clean_group_name}>.*)"

            data.append({
                "Detected Label": label,
                "Generated Regex (Standard)": capture_pattern,
                "Generated Regex (Named Group)": named_pattern
            })

        df = pd.DataFrame(data)
        return df

# ==========================================
# DEMO UTILITIES (To create a dummy file)
# ==========================================
def create_dummy_pdf(filename="sample_form.pdf"):
    """Creates a simple PDF with labeled data for testing."""
    from reportlab.pdfgen import canvas
    
    try:
        c = canvas.Canvas(filename)
        c.drawString(100, 800, "NASA SAFETY REPORT")
        c.drawString(100, 770, "Mission ID: ARTEMIS-II-001")
        c.drawString(100, 750, "Date Reported: 2025-11-21")
        c.drawString(100, 730, "Hazard Level: Critical")
        c.drawString(100, 710, "Description: Sensor malfunction in primary thruster valve.")
        c.drawString(100, 690, "Reporter: J. Doe (Subcontractor)")
        c.save()
        print(f"[Info] Created dummy PDF at: {filename}")
    except ImportError:
        print("[Error] ReportLab not installed. Cannot generate dummy PDF.")
        print("Run: pip install reportlab")

# ==========================================
# MAIN EXECUTION
# ==========================================
if __name__ == "__main__":
    # 1. Check if user provided a file, otherwise generate a demo
    target_file = "sample_form.pdf"
    
    if len(sys.argv) > 1:
        target_file = sys.argv[1]
    elif not os.path.exists(target_file):
        # Generate a sample if no file exists and no argument provided
        # (Requires reportlab: pip install reportlab)
        try:
            create_dummy_pdf(target_file)
        except:
            print("Please provide a path to an existing PDF.")
            sys.exit()

    # 2. Run the Generator
    app = PDFRegexGenerator()
    print(f"--- Parsing: {target_file} ---")
    
    raw_text = app.extract_text(target_file)
    
    if raw_text.startswith("Error"):
        print(raw_text)
    else:
        df = app.generate_schema(raw_text)
        
        if not df.empty:
            print("\n--- Identified Labels & Regex Schema ---")
            # Adjust pandas display options for better visibility in terminal
            pd.set_option('display.max_colwidth', None)
            pd.set_option('display.expand_frame_repr', False)
            
            print(df)
            
            # Optional: Save to CSV
            # df.to_csv("regex_schema.csv", index=False)
            # print("\nSaved table to regex_schema.csv")
        else:
            print("No labels detected using the current heuristic (Text followed by ':').")

How it works

  1. Text Extraction (pypdf): It loads the PDF and scrapes all text from every page.

  2. Label Discovery (Heuristic):

    • The script uses this regex: (?m)^\s*([A-Za-z][\w\s\(\)\/\-]+):

    • (?m): Multiline mode (checks line by line).

    • ^\s*: Matches start of the line (ignoring indentation).

    • [A-Za-z]: Ensures the label starts with a letter (avoids matching timestamps like "12:00").

    • [\w\s\(\)\/\-]+: Allows the label to contain words, spaces, brackets, or dashes.

    • :: The anchor that tells us this is likely a key-value pair.

  3. Regex Generation:

    • It takes the detected label (e.g., Mission ID) and wraps it into a capture pattern: Mission ID:\s*(.*).

    • This pattern tells Python to find the literal phrase "Mission ID:", ignore whitespace \s*, and capture everything else on that line (.*).

Sample Output

If you run the script as is (it creates the dummy NASA report), the output will look like this:

--- Parsing: sample_form.pdf ---

--- Identified Labels & Regex Schema ---
  Detected Label           Generated Regex (Standard)                    Generated Regex (Named Group)
0     Mission ID           Mission ID:\s*(.*)                            Mission ID:\s*(?P<Mission_ID>.*)
1  Date Reported           Date Reported:\s*(.*)                         Date Reported:\s*(?P<Date_Reported>.*)
2   Hazard Level           Hazard Level:\s*(.*)                          Hazard Level:\s*(?P<Hazard_Level>.*)
3    Description           Description:\s*(.*)                           Description:\s*(?P<Description>.*)
4       Reporter           Reporter:\s*(.*)                              Reporter:\s*(?P<Reporter>.*)

Would you like me to...

Add a feature to immediately test the generated regex against the file content to show a preview of the extracted values in a third column?


What are various examples of regex that would be useful for parsing PDFs?

Parsing PDF text is often messy because the "visual" layout is lost during extraction; headers, footers, and multi-column text can get interleaved.1

To handle this, robust regex strategies for PDFs fall into three main categories: Anchors (finding data relative to a label), Entities (finding specific formats like dates), and Cleanup (fixing extraction artifacts).

Here are the most useful regex patterns for parsing PDFs, ranging from simple to advanced.

1. Key-Value Extraction (The "Anchor" Strategy)

This is the most common use case for forms (e.g., Invoices, NASA Reports). You rely on a known "Label" to find the unknown "Value."

GoalRegex PatternExplanation
Basic Label(?m)^Title:\s*(.*)Finds "Title:" at the start of a line and captures the rest of that line. (?m) ensures ^ matches the start of lines, not just the string.
Flexible Separator(?i)Invoice\s*[:#\.-]\s*(\S+)Handles variations like "Invoice:", "Invoice #", "Invoice-". (?i) makes it case-insensitive. Captures the first word after the separator.
Variable SpacingTotal\s+Due\s*[:]\s*(\$[\d,\.]+)Handles extra spaces between words (common in PDFs) like "Total   Due". Captures currency values.
Lookbehind (Clean)(?<=Order ID:\s)(\w+)Uses a Positive Lookbehind to find the ID without consuming the "Order ID: " part in the match result.

2. Entity Extraction (The "Pattern" Strategy)

Use these when you don't know where the data is, but you know what it looks like (e.g., finding every date or email on a page).

GoalRegex PatternExplanation
Dates (ISO)\b\d{4}-\d{2}-\d{2}\bMatches standard YYYY-MM-DD. The \b (word boundary) prevents matching inside longer numbers.
Dates (US)`\b(0?[1-9]1[0-2])/(0?[1-9]
Email Addresses[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}Standard email extraction. Essential for contact lists.
URLs`https?://(?:[-\w.](?:%[\da-fA-F]{2}))+`
Prices/Currency\$\s?[\d,]+(?:\.\d{2})?Matches "$1,000.00" or "$ 50" (handling the space that sometimes appears between symbol and number in PDFs).

3. Advanced: Multiline & Block Extraction

Text in PDFs often wraps to the next line unexpectedly. These patterns help capture "blocks" of text, like a "Description" field.

The "Stop at Next Label" Pattern

This is the holy grail of parsing descriptions. It captures everything after "Description:" until it hits a new line that looks like a new label (e.g., starts with a capital letter).

# Regex:
# 1. Match "Description:"
# 2. Match any character (including newlines) non-greedily
# 3. Stop when we see a newline followed by a Capital Letter (Lookahead)
pattern = r"(?s)Description:\s*(.*?)(?=\n[A-Z][a-z])"
  • (?s): Dot-matches-newline mode (critical for multiline blocks).

  • (?=\n[A-Z]): Positive Lookahead. It asserts that the next thing is a newline and a capital letter, but it doesn't consume it.

4. PDF Cleanup (Sanitization)

Raw PDF text often contains "artifacts"—weird hyphens, excessive spaces, or page numbers breaking up sentences.2

ArtifactRegex to FixPython Replacement
Broken Words(\w)-\n(\w)r"\1\2" (Joins "soft- ware" back to "software")
Excessive Space[ \t]+" " (Replaces multiple tabs/spaces with single space)
Header/Footer Noise(?m)^Page\s+\d+\s+of\s+\d+.*$"" (Deletes lines like "Page 1 of 10")
Leading/Trailing`^\s+\s+$`

Python Example: Applying the "Flexible" Strategy

This script demonstrates how to be "lazy" with regex (using .*?) to handle the unpredictable whitespace of PDFs.

import re

pdf_text = """
INVOICE  # 1024
DATE:    
2025-10-27
TOTAL AMOUNT: $ 4,500.00
NOTES: This is a multiline note
that wraps to the next line.
TERMS: Net 30
"""

# 1. Capture Invoice (Handle spaces between 'Invoice' and '#')
invoice_match = re.search(r"(?i)invoice\s*#\s*(\d+)", pdf_text)

# 2. Capture Date (Handle value being on the NEXT line)
# \s* matches the newline after "DATE:" automatically
date_match = re.search(r"DATE:\s*(\d{4}-\d{2}-\d{2})", pdf_text)

# 3. Capture Multiline Notes (Stop at 'TERMS')
# (?s) allows . to match newlines
notes_match = re.search(r"(?s)NOTES:\s*(.*?)(?=TERMS:)", pdf_text)

print(f"Invoice: {invoice_match.group(1)}")  # 1024
print(f"Date:    {date_match.group(1)}")     # 2025-10-27
print(f"Notes:   {notes_match.group(1).strip().replace(chr(10), ' ')}") 
# Result: "This is a multiline note that wraps to the next line."

Next Step

Would you like to integrate the "Stop at Next Label" logic into the generator app I wrote previously, so it can automatically detect and grab multiline descriptions?


When converting PDF to text, how can I cleanup the text file in Python

Here is a step-by-step guide to cleaning up text files generated from PDFs using Python, focusing on common issues like bad encoding, weird whitespace, and artifact characters.

1. Basic Cleaning with Standard Libraries

Often, PDF extractions result in excessive whitespace, \n characters in the middle of sentences, and non-ASCII characters.

Key Techniques:

  • Encoding fixes: Convert text to UTF-8 to handle special characters.

  • Whitespace normalization: Replace multiple spaces/newlines with a single space.

Python
import re

def clean_text_basic(text):
    # 1. Fix encoding issues (if your text has mojibake like ’)
    # This step is sometimes needed if the extraction didn't handle encoding well
    try:
        text = text.encode('latin1').decode('utf-8')
    except (UnicodeEncodeError, UnicodeDecodeError):
        pass # Skip if text is already handled

    # 2. Remove non-printable characters (control chars)
    # This regex keeps alphanumeric, punctuation, and whitespace
    text = "".join(ch for ch in text if ch.isprintable())

    # 3. Normalize whitespace
    # Replaces tabs, newlines, and multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text)

    return text.strip()

# Example usage
raw_text = "This is  a \n PDF text   with \t weird   spacing."
cleaned = clean_text_basic(raw_text)
print(cleaned)
# Output: "This is a PDF text with weird spacing."

2. Advanced Cleaning (Regex Patterns)

PDFs often contain headers, footers, page numbers, and hyphenated words broken across lines.

Common Patterns to Fix:

  • Hyphenation: Join words split by a hyphen and a newline (e.g., "environ-\nment" -> "environment").

  • CID codes: Remove mapping artifacts like (cid:123).

  • URLs/Emails: Sometimes you want to remove or normalize these.

Python
def clean_text_advanced(text):
    # 1. Fix hyphenation at line breaks (e.g. "pro-\ncess" becomes "process")
    # We look for a hyphen, optional whitespace/newline, and the rest of the word
    text = re.sub(r'(\w+)-\s*\n\s*(\w+)', r'\1\2', text)
    
    # 2. Remove specific PDF artifacts like (cid:xx)
    text = re.sub(r'\(cid:\d+\)', '', text)

    # 3. Remove Page Numbers (simple heuristic: standalone numbers at start/end of lines)
    # This looks for a number on its own line
    text = re.sub(r'^\s*\d+\s*$', '', text, flags=re.MULTILINE)

    # 4. Collapse multiple newlines into paragraph breaks
    # This preserves paragraphs but removes single line breaks inside sentences
    text = re.sub(r'\n{2,}', 'PARAGRAPH_BREAK', text) # Mark paragraphs
    text = re.sub(r'\n', ' ', text)                    # Join lines within paragraph
    text = text.replace('PARAGRAPH_BREAK', '\n\n')     # Restore paragraphs

    return text

3. Using ftfy (Fix Text For You)

If your PDF text is full of "garbage" characters (like é instead of é), the library ftfy is the gold standard for automatically fixing broken unicode.

Installation:

Bash
pip install ftfy

Usage:

Python
import ftfy

raw_text = "The text has “smart quotes†and mojibake."
cleaned = ftfy.fix_text(raw_text)

print(cleaned)
# Output: "The text has “smart quotes” and mojibake."

4. Cleaning with NLP (Spacy / NLTK)

If you need the text for machine learning or analysis, you might want to remove stopwords or lemmatize words.

Python
import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

def clean_with_nlp(text):
    doc = nlp(text)
    
    # Keep only words that are NOT stop words, punctuation, or numbers
    tokens = [
        token.lemma_.lower() 
        for token in doc 
        if not token.is_stop and not token.is_punct and not token.like_num
    ]
    
    return " ".join(tokens)

Summary of Tools

ToolBest Used For
re (Regex)Removing specific patterns (page numbers, headers, unwanted symbols).
ftfyFixing broken unicode (mojibake) and encoding errors automatically.
unicodedataNormalizing characters (e.g., converting fancy quotes to straight quotes).
spacy / nltkPreparing text for AI/ML (lemmatization, stopword removal).

Next Step

Would you like me to write a script that specifically targets removing headers and footers, which are notoriously difficult in PDF text extraction?


Yes

Removing headers and footers from an already extracted text file is challenging because they are often interleaved with the main content.

Since you are a software veteran, you likely prefer a deterministic solution. However, because flat text lacks coordinate data, we must use Frequency Analysis (statistical probability) and Pattern Matching.

Here is a robust Python approach that targets headers/footers based on the "First Principle" that headers repeat, while body text does not.

Method 1: The Statistical Cleaner (For existing text files)

This script analyzes the entire text file to find lines that repeat frequently (static headers) or match page-number patterns (dynamic footers).

import re
from collections import Counter

def clean_headers_footers(text_content, repetition_threshold=3):
    """
    Removes lines that repeat often (headers) or look like page numbers.
    
    Args:
        text_content (str): The full text string.
        repetition_threshold (int): If a line appears more than this many times, 
                                    assume it's a header/footer artifact.
    """
    
    lines = text_content.split('\n')
    
    # 1. Identify Repetitive Lines (Static Headers/Footers)
    # We strip whitespace to catch headers that might have slight indentation variations
    line_counts = Counter(line.strip() for line in lines if line.strip())
    
    # Identify lines that appear frequently (likely headers like "NASA-STD-8719.13")
    # We ignore very short lines (len < 4) to avoid deleting common bullets or numbers.
    repetitive_lines = {
        line for line, count in line_counts.items() 
        if count >= repetition_threshold and len(line) > 4
    }

    cleaned_lines = []
    
    # 2. Define Regex for Dynamic Footers (Page X of Y)
    # Matches: "Page 1", "Page 1 of 20", "12 / 40", "- 5 -"
    page_num_pattern = re.compile(
        r'^\s*(-?\s*(page|pg)\.?\s*\d+(\s*(of|/)\s*\d+)?\s*-?|^\s*\d+\s*$)', 
        re.IGNORECASE
    )

    for line in lines:
        stripped = line.strip()
        
        # Check A: Is it a static repeating header?
        if stripped in repetitive_lines:
            continue 
            
        # Check B: Is it a dynamic page number?
        if page_num_pattern.match(stripped):
            continue
            
        cleaned_lines.append(line)

    # Reassemble text
    return '\n'.join(cleaned_lines)

# --- Example Usage ---
sample_text = """
NASA SAFETY REPORT
Page 1 of 5
The system shall perform a self-check on startup.
NASA SAFETY REPORT
Page 2 of 5
If the check fails, the system shall abort.
NASA SAFETY REPORT
Page 3 of 5
End of report.
"""

clean_text = clean_headers_footers(sample_text, repetition_threshold=2)
print(clean_text)
# Output will only contain the body text lines.

Method 2: The Geometric Crop (The "Upstream" Fix)

If you have the ability to re-run the extraction (rather than just cleaning the output file), the most accurate method is to ignore the top and bottom 10% of the PDF page entirely. This bypasses the text processing guesswork.

You can use pdfplumber for this, which is excellent for table and layout preservation.

import pdfplumber

def extract_body_text_only(pdf_path):
    full_text = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            width = page.width
            height = page.height
            
            # Define crop box: (x0, top, x1, bottom)
            # This crops out the top 10% and bottom 10% of the page
            header_cutoff = height * 0.10
            footer_cutoff = height * 0.90
            
            # Crop the page before extraction
            cropped_page = page.crop((0, header_cutoff, width, footer_cutoff))
            
            # Extract text from the "safe zone"
            text = cropped_page.extract_text()
            if text:
                full_text.append(text)
                
    return "\n".join(full_text)

Which one should you use?

ScenarioRecommended Method
You only have the .txt fileUse Method 1 (Statistical). Since you are working with NASA reports, the headers are likely standardized (e.g., "NPR 7150.2"), making them easy for the Counter to catch.
You have the raw .pdfUse Method 2 (Geometric). It is safer because it won't accidentally delete a sentence that happens to repeat twice in the text.

Note on Semantic Search

Since you are building a semantic search tool for hazard reports, Method 1 is safer for preserving context if you can't re-extract. Be careful with the repetition_threshold—if you set it too low (e.g., 2), you might accidentally delete a common hazard warning phrase that appears multiple times in the body text (e.g., "Catastrophic failure mode"). Setting it to 4 or 5 is usually the "sweet spot" for document processing.

Would you like to see how to "chunk" this cleaned text specifically for the semantic search embedding process you are building?


What I want is sequential parsing of the document while looping over a a list of regex strings

This approach aligns perfectly with a pipeline architecture. By treating your cleaning process as a sequential series of transformations, you gain deterministic control and can debug specific steps without breaking the whole chain.

Since you are a software veteran, I have structured this as a configurable engine where the "logic" is separated from the "rules" (the list of regex strings).

The Sequential Regex Pipeline

Here is a Python implementation that iterates over a list of cleaning rules. I have included a verbose mode so you can see exactly how many times each rule triggers—essential for debugging aggressive regex patterns.

import re

def clean_text_pipeline(text, cleaning_rules, verbose=False):
    """
    Sequentially applies a list of regex substitution rules to the text.
    
    Args:
        text (str): The raw text to clean.
        cleaning_rules (list): A list of tuples: (Description, Pattern, Replacement, Flags).
        verbose (bool): If True, prints stats about replacements made.
        
    Returns:
        str: The cleaned text.
    """
    
    total_subs = 0
    
    for name, pattern, replacement, flags in cleaning_rules:
        # Pre-compile for efficiency if processing large/multiple files
        regex = re.compile(pattern, flags)
        
        if verbose:
            # Count matches before replacing (for logging)
            match_count = len(regex.findall(text))
            if match_count > 0:
                print(f"[HIT] {name}: Found {match_count} matches.")
            total_subs += match_count

        # Apply the substitution
        text = regex.sub(replacement, text)

    if verbose:
        print(f"--- Pipeline Complete: {total_subs} total changes made ---")
        
    return text

# --- Configuration: The Rules List ---
# Format: (Name, Regex Pattern, Replacement String, Regex Flags)
# ORDER MATTERS: Rules are applied from top to bottom.

pdf_cleaning_rules = [
    # 1. Remove Page Numbers (e.g., "Page 1 of 5" or standalone numbers on a line)
    ("Remove Page X of Y", r'Page \d+ of \d+', '', re.IGNORECASE),
    ("Remove Standalone Page Nums", r'^\s*\d+\s*$', '', re.MULTILINE),

    # 2. Remove Specific Headers (e.g., NASA Standard Headers)
    ("Remove NASA Headers", r'^NASA-STD-[\w\.]+', '', re.MULTILINE),

    # 3. Fix Hyphenated Words split across lines (e.g., "equip-\nment")
    # We replace "-\n" with empty string to join the word parts
    ("Fix Hyphenation", r'(\w)-\n(\w)', r'\1\2', 0),

    # 4. Remove CID Artifacts (common in PDF extraction)
    ("Remove CID Codes", r'\(cid:\d+\)', '', 0),

    # 5. Collapse excessive newlines into a standard paragraph break
    # First, turn 3+ newlines into a marker, then single newlines to spaces, then marker back to \n\n
    ("Normalize Paragraphs", r'\n{2,}', '\n\n', 0),
    ("Join Broken Lines", r'(?<!\n)\n(?!\n)', ' ', 0), # Join lines that aren't paragraph breaks
]

# --- Execution ---
raw_text_sample = """
NASA-STD-8719.13
Page 1 of 10
1.0 Introduction
The soft-
ware must be safe.
(cid:124)
It shall not fail.
2
"""

cleaned_text = clean_text_pipeline(raw_text_sample, pdf_cleaning_rules, verbose=True)

print("\n--- Final Output ---")
print(cleaned_text)

Why Order Matters (First Principles)

When defining your list, the sequence is critical. Here is the logic for the order used above:

  1. Destruction First (Headers/Footers): You must remove page numbers and headers before you mess with newlines. If you join lines first, a page number like "42" might get merged into the middle of a sentence (e.g., "The system 42 shall fail").

  2. Repair Second (Hyphenation): Fix words broken by newlines (e.g., "environ-\nment") before you convert newlines to spaces, or you will end up with "environ- ment".

  3. Normalization Last (Whitespace): Once the artifacts are gone and words are repaired, you can safely collapse the remaining whitespace into readable paragraphs.

Debugging Your Regex

Since you are dealing with NASA documents, you might run into "greedy" regex problems where a pattern deletes too much.

If you suspect a rule is deleting valuable text, you can temporarily swap the replacement string with a marker to see what is being hit:

# Temporary debug rule
("Remove Headers", r'^NASA-STD-[\w\.]+', '[[DELETED_HEADER]]', re.MULTILINE)

Next Step

Would you like me to wrap this into a Python Class that allows you to load these regex rules from an external JSON or YAML file? This would essentially create a lightweight DSL for your cleaning process, keeping your code separate from your data.


From <https://gemini.google.com/app/>  Google Gemini 3.0 Pro (Thinking)


Example Completed Tax Form: https://extension.usu.edu/ruraltax/files/joe-farmer-1040-2021.pdf

Comments

Popular posts from this blog

#2892 Example of ChatGPT o1 Prompt

#3105 Federal and State Spending Idiocy