#3996 Optimized Prompt for Parsing a PDF for Know Data Fields -> Regex

#3996 Provide an optimized prompt for the following: """User Prompt: Read the uploaded PDF file to find the following data fields, as described, and provide the regex string to parse the file to collect the associated data:

Data_Fields = {

[ 'name':'field1', 'description':'field description 1', 'regex':''],

[ 'name':'field2', 'description':'field description 2', 'regex':''],

[ 'name':'field3', 'description':'field description 3', 'regex':''],

[ 'name':'field4', 'description':'field description 4', 'regex':''],

[ 'name':'field5', 'description':'field description 5', 'regex':'']

}

"""

This is a classic data extraction task. Since you have a deep background in software engineering and process engineering, you know that the "garbage in, garbage out" rule applies heavily to LLM-generated regex.

To get the most accurate results—especially for complex NASA-style documentation or technical specs—the prompt needs to instruct the AI to first analyze the document's layout and then verify the regex against the actual text it sees.

Here is an optimized version of your prompt:

Optimized Prompt

Role: Act as an expert Data Engineer and Regex Specialist.

Task: Analyze the attached PDF to extract specific data patterns. You will identify the location and format of the requested fields and provide a robust Regular Expression (PCRE) for each.

Process:

Analyze Text Layout: Examine how the data is structured (e.g., key-value pairs, tables, or nested within paragraphs).
Identify Samples: Find three distinct examples of each field within the document to ensure the regex accounts for variations in whitespace or character length.
Generate Regex: Create a regex string that is specific enough to avoid false positives but flexible enough to handle standard PDF encoding quirks (like inconsistent spacing).

Output Format: Return the data in the following JSON format. Ensure the regex field contains the valid string:

JSON
[
  { "name": "field1", "description": "field description 1", "regex": "insert_regex_here" },
  { "name": "field2", "description": "field description 2", "regex": "insert_regex_here" },
  { "name": "field3", "description": "field description 3", "regex": "insert_regex_here" },
  { "name": "field4", "description": "field description 4", "regex": "insert_regex_here" },
  { "name": "field5", "description": "field description 5", "regex": "insert_regex_here" }
]

Data Fields to Map:

Field 1: [Insert Name] - [Insert Description]
Field 2: [Insert Name] - [Insert Description]
Field 3: [Insert Name] - [Insert Description]
Field 4: [Insert Name] - [Insert Description]
Field 5: [Insert Name] - [Insert Description]

Why this works better:

Contextual Awareness: By asking the AI to "Analyze Text Layout" first, you prevent it from giving "generic" regex. It will look for specific anchors (like colons, specific headers, or coordinate-based positioning) present in your PDF.
Validation Step: Forcing it to find "three distinct examples" mimics the unit testing process you’re familiar with, reducing the likelihood of a regex that only works for the first instance it finds.
Structured Output: It returns a clean JSON array that you can immediately pipe into a Python script or a Knowledge Graph ingestion tool.

A Note on PDF "Noise"

When parsing PDFs, invisible characters or ligatures (like "fi" becoming a single character) can break regex. If this is for a high-stakes NASA SMA project, would you like me to add logic to the prompt that specifically handles non-breaking spaces and Unicode variations common in technical PDFs?

Yes

Since these PDFs likely involve technical specifications, safety standards, or mission-critical data, the regex needs to be "battle-hardened."

To handle the "noise" (like varying whitespace, ligatures, or OCR artifacts), I have added specific instructions to use non-capturing groups, flexible whitespace meta-characters, and boundary anchors.

The "Battle-Hardened" Optimized Prompt

Role: Act as a Senior Software Engineer and Regex Specialist with expertise in technical document parsing.

Context: You are parsing a technical PDF that may contain inconsistent spacing, non-breaking spaces (\xA0), or standard OCR ligatures. Accuracy is mission-critical.

Task: Read the uploaded PDF and generate robust Regular Expressions (PCRE) for the fields listed below.

Requirements for the Regex:

Whitespace Resilience: Use \s+ or [ \t\xA0]+ instead of literal spaces to account for PDF encoding quirks.
Anchor Identification: Look for consistent "anchor" text (e.g., "ID:", "Status:", "Hazard Level") and include them in the regex as non-capturing groups (?:...) to increase precision.
Boundary Protection: Use word boundaries \b to prevent partial matches.
Validation: For each field, provide a brief "match explanation" explaining why that specific pattern was chosen based on the document's structure.

Output Format: Return a JSON array.

[
  { 
    "name": "field1", 
    "description": "description", 
    "regex": "string",
    "explanation": "why this pattern fits the document layout"
  }
]

Fields to Extract:

Field 1: [Insert Name] - [Insert Description]
Field 2: [Insert Name] - [Insert Description]
Field 3: [Insert Name] - [Insert Description]
Field 4: [Insert Name] - [Insert Description]
Field 5: [Insert Name] - [Insert Description]

Pro-Tip for your Workflow:

When you run this, if the PDF has a complex table structure, the AI might struggle with "reading order" (the way PDFs stream text can jump between columns).

If you find the regex is failing because the data is in a multi-column format, would you like me to provide a Python snippet using pdfplumber or PyMuPDF that you can use to pre-process the text into a clean string before applying these regexes?

From <https://gemini.google.com/app/> Google Gemini 3.0 Pro (Thinking)

Search This Blog

Learning from ChatGPT