#4027 Python REGEX Generator

#4027 Python REGEX Generator

#4027 Once I have used Docling to convert a PDF file to a text file, including tables, how can I use Python to identify useful regex to extract data?

Since you are using Docling, you likely have a relatively clean Markdown or JSON representation of your PDF. Moving from raw text to structured data via Regex (Regular Expressions) can be tedious, but you can use Python to "mine" the text for patterns and then validate them.

Here is a workflow to identify and refine Regex patterns for your data extraction.


1. Pattern Discovery with re.finditer

Instead of guessing patterns, use Python to scan for common anchors (like "Table 1" or specific NASA SWE codes) and look at the context around them.

import re

# Load your Docling output
with open("converted_doc.txt", "r") as f:
    content = f.read()

# Strategy: Find a known anchor and see what the surrounding text looks like
# Example: Looking for requirement IDs like SWE-001 or NPR sections
anchor = r"SWE-\d+" 

for match in re.finditer(anchor, content):
    # Print the match and 50 characters of context to see the structure
    start, end = match.span()
    print(f"Found: {match.group()} | Context: {content[start:end+100].strip()}")

2. Using CommonRegex or PandaPy

Before writing custom expressions, check if the data (dates, version numbers, emails, or coordinates) can be caught by pre-built libraries.

  • CommonRegex: Great for standard formats.

  • re.findall with Character Classes: If you notice that your tables use a specific delimiter (like | in Markdown), use that as your primary structural regex.


3. The "Regex Prototyping" Script

To identify the best regex, you want to test multiple variations against a sample of your text and see which one has the highest "hit rate" without false positives.

def test_patterns(patterns, text):
    results = {}
    for name, p in patterns.items():
        matches = re.findall(p, text)
        results[name] = {"count": len(matches), "samples": matches[:3]}
    return results

# Define potential patterns based on your Docling text observation
candidate_patterns = {
    "NPR_Section": r"Section\s\d+\.\d+\.\d+",
    "Requirement_ID": r"\[SWE-\d+[a-z]?\]",
    "Table_Row": r"\|.*\|.*\|.*\|" # Typical Markdown table row
}

analysis = test_patterns(candidate_patterns, content)
for p_name, data in analysis.items():
    print(f"Pattern: {p_name} | Found: {data['count']} | Examples: {data['samples']}")

4. Converting Tables to DataFrames

Since Docling handles tables well, your text file likely contains Markdown tables. Instead of using Regex to parse the entire table, use Regex to find the start and end of the table, then let pandas do the heavy lifting.

import pandas as pd
from io import StringIO

# Regex to find a Markdown table (lines starting and ending with |)
table_pattern = r"(\|.*\|(?:\n\|.*\|)+)"
tables = re.findall(table_pattern, content)

if tables:
    # Clean the first table found and load into a DataFrame
    table_str = tables[0]
    # Remove the '---' separator line if present for cleaner parsing
    clean_table = re.sub(r"\|?[\s-]*:?---:?[\s-]*\|", "", table_str)
    
    df = pd.read_table(StringIO(clean_table), sep="|", skipinitialspace=True).dropna(axis=1, how='all')
    print(df.head())

5. Helpful Tips for Your Workflow

  • Use re.VERBOSE: When your regex gets long (especially for NASA compliance docs), use the verbose flag to add comments inside the regex string.

  • Regex101.com: Copy a chunk of your Docling output here. It’s the gold standard for testing. Set the flavor to Python.

  • Lookarounds: If you need to extract text between two headers (like "Requirement" and "Assessment"), use Positive Lookahead/Lookbehind:

    • (?<=Requirement:).*?(?=Assessment:)

Would you like me to help you draft a specific Regex pattern for a particular NASA requirement format or table structure you're seeing in your files?


I want the Python code to automatically identify "labels" within the text file that may or may not be followed by a colon (":") and associated useful information. For each of these potential "labels", generate a regex expression for extracting that data on a following pass.

This approach effectively turns your Python script into a regex factory. Since Docling outputs structured text, labels often appear at the start of a line or within table cells, typically followed by a colon or a specific whitespace pattern.

Here is a script that scans your text, identifies potential "Key: Value" pairs, and generates a dictionary of ready-to-use Python regex strings.


Automated Label & Regex Generator

This script uses a "discovery regex" to find potential labels and then builds specific extraction patterns for each one.

import re
from collections import Counter

def generate_regex_map(file_path):
    with open(file_path, "r") as f:
        content = f.read()

    # 1. Discovery Phase: Look for short phrases (2-4 words) at the start 
    # of a line followed by a colon or a tab/multiple spaces.
    # Pattern: Start of line -> 2-30 characters -> optional colon -> space/newline
    discovery_pattern = r"^(?P<label>[\w\s]{2,30}?)\s*[::]\s*(?P<value>.*)"
    
    found_labels = []
    # We use MULTILINE so ^ matches the start of every line in the doc
    matches = re.finditer(discovery_pattern, content, re.MULTILINE)
    
    for match in matches:
        label = match.group("label").strip()
        if label:
            found_labels.append(label)

    # 2. Refinement: Filter out duplicates and keep common labels
    label_counts = Counter(found_labels)
    unique_labels = sorted(label_counts.keys())

    # 3. Regex Generation: Create a mapping of Label -> Specific Regex
    regex_factory = {}
    for label in unique_labels:
        # We escape the label to handle any special characters like ( ) or [ ]
        escaped_label = re.escape(label)
        
        # This generated regex captures everything after the label until the line ends
        # (?i) makes it case-insensitive
        # \s*[:]?\s* handles cases with or without a colon
        generated_pattern = fr"(?i)^{escaped_label}\s*[:]?\s*(.*)"
        
        regex_factory[label] = generated_pattern

    return regex_factory

# Execution
file_name = "your_docling_output.txt"
generated_map = generate_regex_map(file_name)

print(f"--- Identified {len(generated_map)} Potential Labels ---")
for label, pattern in generated_map.items():
    print(f"Label: {label:<20} | Suggested Regex: {pattern}")

How to use the generated patterns

Once you have the generated_map, you can use it in a second pass to extract your data into a structured format like a list of dictionaries or a CSV.

Data Extraction Script

import json

def extract_structured_data(content, regex_map):
    results = {}
    for label, pattern in regex_map.items():
        match = re.search(pattern, content, re.MULTILINE)
        if match:
            results[label] = match.group(1).strip()
    return results

# Example Usage
with open("your_docling_output.txt", "r") as f:
    text_data = f.read()

structured_info = extract_structured_data(text_data, generated_map)
print(json.dumps(structured_info, indent=2))

Enhancing the Discovery Logic

Since you are likely dealing with NASA technical standards, labels might not always be followed by a colon. They might be header cells in a table.

Strategy for Table-based Labels

If Docling converted your tables to Markdown (e.g., | Label | Value |), you should modify the discovery pattern in the script above to:

  • \|(?P<label>.*?)\|(?P<value>.*?)\|

Dealing with Multi-line Values

Often, a requirement description (Value) spans multiple lines. To capture those, change the generated regex to use the dot-all flag or a negative lookahead for the next label:

  • Pattern: fr"(?i)^{escaped_label}\s*[:]?\s*(.*?)(?=\n[A-Z\s]{2,20}:|$)"

    • Explanation: This captures everything until it sees another line starting with a label-like structure (2-20 uppercase characters followed by a colon).

Would you like me to adjust the discovery logic to specifically prioritize NPR 7150.2 formatting, such as finding "SWE" IDs and their corresponding requirement text?


Does extract_structured_data find more than one occurrence of an item?

No, the extract_structured_data function I provided uses re.search(), which stops after finding the first match in the text.

In NASA documentation or complex technical manuals, a label like "Requirement" or "SWE ID" will appear dozens or hundreds of times. To capture every instance, you need to switch from re.search() to re.findall() or re.finditer().

Here is the upgraded logic to handle multiple occurrences and organize them into a list.


1. Upgraded Extraction Script (Multiple Occurrences)

This version returns a list of values for every label it finds, which is essential for parsing tables or repeated requirement blocks.

import re

def extract_all_occurrences(content, regex_map):
    results = {}
    for label, pattern in regex_map.items():
        # Using re.MULTILINE to ensure ^ matches starts of lines
        # findall returns a list of all captured groups
        matches = re.findall(pattern, content, re.MULTILINE)
        
        # Clean up whitespace for each match found
        results[label] = [m.strip() for m in matches if m.strip()]
        
    return results

# Example Usage:
# If "SWE-001" and "SWE-002" both exist, results["SWE"] will be a list of both.

2. Handling "Blocks" of Data

When you have multiple occurrences, the data is often "chunked." For example, you might have:

  • Label: Requirement ID

  • Label: Description

  • Label: Compliance Matrix

If you use the function above, you will get three separate lists. The problem is that if one requirement is missing a description, your lists will be "out of sync" (e.g., Requirement #10 might match with Description #9).

To fix this, you should identify a Record Separator (like a horizontal rule --- or a specific ID pattern) and split the text into chunks first:

def extract_as_records(content, record_delimiter, regex_map):
    # Split the document into individual sections/records
    records = re.split(record_delimiter, content)
    
    structured_records = []
    
    for record in records:
        entry = {}
        for label, pattern in regex_map.items():
            match = re.search(pattern, record, re.MULTILINE)
            if match:
                entry[label] = match.group(1).strip()
        
        if entry: # Only add if we actually found data
            structured_records.append(entry)
            
    return structured_records

# Example: Split by SWE ID patterns
# records = extract_as_records(text_data, r"(?=SWE-\d+)", generated_map)

3. Comparison of Methods

MethodBest For...Returns
re.searchMetadata (Title, Date, Document ID)A single string.
re.findallIndependent lists (All emails, all part numbers)A list of strings.
ChunkingRequirements/Tables (Keeping ID and Text together)A list of dictionaries (Records).

4. Pro-Tip: The "Non-Greedy" Trap

When finding multiple occurrences of labels followed by colons, ensure your generated regex uses the non-greedy quantifier .*?.

If your regex is Label: (.*), and there are two "Label:" entries on the same line, the .* might swallow the second label. Use Label: (.*?) to stop at the first reasonable breakpoint.

Would you like me to show you how to define a Record Separator specifically for the NPR 7150.2 format so your extracted data stays grouped correctly?


From <https://gemini.google.com/app/>  Google Gemini 3.0 Pro (Thinking)

Comments

Popular posts from this blog

#2892 Example of ChatGPT o1 Prompt

#3105 Federal and State Spending Idiocy