#4187 Scripted Document Review

#4187 Scripted Document Review

#4187 Consider the the following application design: """

INPUTS

#   Document to be reviewed

#   Script.txt: Text file with questions, requirements, regex, prompts

# OUTPUTS

#   Report file


Optional: For all PDF documents in directory:

  Foe each document:

    Read Script.txt

    For each text_row in Script:

      Call LLM with sys_instruction

      CALL LLM with document and text_row

      Write text_row and response to report file


Sys_Instructions:

You are a [...] specialist. Your role is to assess a file that the user provides against specified criteria, either requirements, questions, regex, or a prompt.


#======================================================================================


Example Prompts:


Requirement

  Analyze the provided document to determine whether the specified [REQUIREMENT] is satisfied, and explain your answer in a paragraph or less. """{requirement}"""


Question

  Answer the specified _QUESTION_ about the provided document, and explain your answer in a paragraph or less. """{question}"""


Regex

  Apply the specified _REGEX_ to the provided document to extract the content defined by the regex statement. """{regex}"""


Scripted Prompt

  """{prompt}"""


#======================================================================================


Script.txt


[SYS_INSTRUCTIONS]: .....................


[SCRIPT_PROMPT]: ........................


[REQUIREMENT]: ..........................


[REQUIREMENT]: ..........................



[QUESTION]: .............................



[REGEX]: ................................



[USER PROMPT]: ..........................

"""

I have reviewed your application design. It outlines a solid, highly flexible architecture for an automated, batch-processing document review system. By decoupling the evaluation criteria (Script.txt) from the execution logic, you make the system easily adaptable for different types of documents and review standards without needing to change the core code.

However, to make this design more efficient, cost-effective, and robust for a production environment, there are a few architectural optimizations you should consider.

Here is an assessment of your design, along with recommendations for improvement:

1. The Regex Anti-Pattern

  • Current Design: Passing a Regex string to the LLM and asking the LLM to apply it to the document.

  • Recommendation: Do not use an LLM for Regular Expressions. LLMs are probabilistic, whereas Regex is strictly deterministic. Asking an LLM to process Regex is significantly slower, more expensive, and prone to hallucination.

  • Fix: Handle [REGEX] tags directly in your application code (e.g., using Python's re module) against the extracted document text, bypassing the LLM entirely for those specific script rows.

2. Token Optimization and API Calls

  • Current Design: The loop calls the LLM with the entire document for every single text_row in the script.

  • Recommendation: If a document is 50 pages and your script has 20 requirements/questions, you are sending the same 50 pages to the LLM 20 separate times. This will result in massive token costs and slow execution times.

  • Fix: Consolidate the prompts. Send the document to the LLM once per file, along with a batched list of all the [REQUIREMENT] and [QUESTION] items formatted as a single structured prompt (e.g., asking the LLM to return a JSON object containing the answers to all 20 questions at once).

3. Loop Efficiency

  • Current Design: Read Script.txt happens inside the "For each document" loop.

  • Recommendation: Disk I/O should be minimized.

  • Fix: Read, parse, and structure the contents of Script.txt into a list or dictionary of tasks before initiating the document loop.

4. PDF Handling and Context Windows

  • Current Design: Passing the "document" to the LLM.

  • Recommendation: Be specific about how the PDF is handled. Standard LLMs require plain text, meaning you need a PDF parsing library (like pdfplumber or PyMuPDF) to extract text first. If the PDF contains crucial images, tables, or exceeds the token limit, you will need a multimodal LLM (like Gemini 1.5 Pro) or a Retrieval-Augmented Generation (RAG) chunking strategy.


Suggested Refined Pseudocode

# INPUTS
#   Directory of PDFs
#   Script.txt

# INITIALIZATION
Parse Script.txt ONCE into a task list (separating LLM tasks from Regex tasks)

# EXECUTION
For each PDF in directory:
    Extract text from PDF
    
    # Deterministic tasks (Fast, Free)
    For regex_task in script_tasks:
        Execute local Regex search on PDF text
        Append result to Report data
        
    # Probabilistic tasks (Slower, Costly)
    Batch all questions/requirements/prompts into ONE single prompt instruction
    Call LLM with sys_instruction + combined_prompt + PDF text
    (Instruct LLM to output structured JSON mapping to the questions)
    
    Append LLM JSON responses to Report data

# OUTPUT
Write Report data to CSV/JSON

Would you like me to write a working Python prototype based on this optimized architecture, or would you prefer to refine the specific prompt templates inside Script.txt first?


Yes, support AskSage.ai

Here is a working Python prototype tailored for the Ask Sage platform.

This script uses the official asksageclient to interface with the API, PyMuPDF (fitz) for fast, local PDF text extraction, and standard Python regular expressions to keep the deterministic extractions completely local and free of token costs.

Prerequisites

First, you will need to install the required libraries:

pip install asksageclient pymupdf

The Python Prototype

Save this as reviewer.py. You will need to set your Ask Sage credentials as environment variables (ASKSAGE_EMAIL and ASKSAGE_API_KEY) to keep them secure.

import os
import re
import json
import fitz  # PyMuPDF
from asksageclient import AskSageClient

# --- Configuration ---
PDF_DIRECTORY = "./documents"
SCRIPT_FILE = "Script.txt"
REPORT_FILE = "Review_Report.json"

# Initialize Ask Sage Client
ASKSAGE_EMAIL = os.getenv("ASKSAGE_EMAIL", "your_email@example.com")
ASKSAGE_API_KEY = os.getenv("ASKSAGE_API_KEY", "your_api_key")

try:
    client = AskSageClient(ASKSAGE_EMAIL, ASKSAGE_API_KEY)
except Exception as e:
    print(f"Error initializing Ask Sage Client: {e}")
    exit(1)


def parse_script(filepath):
    """
    Parses Script.txt to separate deterministic Regex tasks from probabilistic LLM tasks.
    """
    sys_instruction = ""
    regex_tasks = []
    llm_tasks = []

    with open(filepath, 'r', encoding='utf-8') as f:
        content = f.read()

    # Simple parsing logic (can be made more robust depending on your exact formatting)
    lines = content.split('\n')
    current_section = None

    for line in lines:
        if line.startswith("[SYS_INSTRUCTIONS]:"):
            sys_instruction = line.replace("[SYS_INSTRUCTIONS]:", "").strip()
        elif line.startswith("[REGEX]:"):
            regex_tasks.append(line.replace("[REGEX]:", "").strip())
        elif line.startswith("[REQUIREMENT]:") or line.startswith("[QUESTION]:") or line.startswith("[SCRIPT_PROMPT]:"):
            llm_tasks.append(line.strip())

    return sys_instruction, regex_tasks, llm_tasks


def extract_text_from_pdf(filepath):
    """
    Extracts plain text from a PDF document using PyMuPDF.
    """
    text = ""
    try:
        with fitz.open(filepath) as doc:
            for page in doc:
                text += page.get_text() + "\n"
    except Exception as e:
        print(f"Error reading {filepath}: {e}")
    return text


def process_documents():
    """
    Main loop to process all PDFs in the directory against the script tasks.
    """
    print("Parsing Script...")
    sys_instruction, regex_tasks, llm_tasks = parse_script(SCRIPT_FILE)
    
    final_report = {}

    # Ensure the directory exists
    if not os.path.exists(PDF_DIRECTORY):
        print(f"Directory not found: {PDF_DIRECTORY}")
        return

    # Process each PDF
    for filename in os.listdir(PDF_DIRECTORY):
        if not filename.lower().endswith('.pdf'):
            continue

        filepath = os.path.join(PDF_DIRECTORY, filename)
        print(f"\nProcessing: {filename}")
        
        doc_text = extract_text_from_pdf(filepath)
        if not doc_text.strip():
            print("  Skipping: No text extracted.")
            continue

        file_results = {"Regex_Matches": [], "LLM_Analysis": {}}

        # 1. Execute Deterministic Tasks (Regex) locally
        print("  Running Regex tasks...")
        for pattern in regex_tasks:
            try:
                # Catching and extracting all instances precisely
                matches = re.findall(pattern, doc_text)
                file_results["Regex_Matches"].append({"Pattern": pattern, "Matches": matches})
            except re.error as e:
                print(f"  Regex Error for pattern {pattern}: {e}")

        # 2. Execute Probabilistic Tasks (Ask Sage LLM) via batching
        if llm_tasks:
            print("  Querying Ask Sage...")
            
            # Format the batched prompt
            batched_questions = "\n".join([f"{i+1}. {task}" for i, task in enumerate(llm_tasks)])
            
            prompt_payload = f"""
{sys_instruction}

Please review the following document text and address the listed requirements/questions. 
Return your answers in a structured JSON format where the keys are the question numbers ("1", "2", etc.).

DOCUMENT TEXT:
---
{doc_text[:15000]} # Truncating to prevent token overload; adjust based on Ask Sage model limits
---

TASKS TO EVALUATE:
{batched_questions}
"""
            # Send the batched query to Ask Sage
            response = client.query(message=prompt_payload)
            
            if response:
                # Store the raw response (you can add JSON parsing logic here if you strictly enforce JSON output)
                file_results["LLM_Analysis"] = response.get('message', str(response))
            else:
                file_results["LLM_Analysis"] = "Error: No response from Ask Sage."

        final_report[filename] = file_results

    # Write output to JSON
    with open(REPORT_FILE, 'w', encoding='utf-8') as f:
        json.dump(final_report, f, indent=4)
        
    print(f"\nProcessing complete. Report saved to {REPORT_FILE}")


if __name__ == "__main__":
    # Create dummy files for testing if they don't exist
    if not os.path.exists(SCRIPT_FILE):
        with open(SCRIPT_FILE, 'w') as f:
            f.write("[SYS_INSTRUCTIONS]: You are a compliance specialist.\n")
            f.write("[REGEX]: \\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}\\b\n") # Email extraction
            f.write("[QUESTION]: What is the main objective of this document?\n")
            
    if not os.path.exists(PDF_DIRECTORY):
        os.makedirs(PDF_DIRECTORY)
        print(f"Created '{PDF_DIRECTORY}' folder. Please place some PDFs inside and run again.")
    else:
        process_documents()

Key Optimizations in this Script

  1. Local Text Extraction: PyMuPDF is exceptionally fast. It grabs the text locally before any API calls are made.

  2. Definitive Regex Extraction: As you've noted in the past, a regex approach is vastly superior for definitively catching and extracting structured elements (like identifiers, SWEs, emails, etc.). By keeping regex in the re module rather than the LLM, you guarantee zero hallucinations and instantaneous execution.

  3. Ask Sage Query Concatenation: Because Ask Sage's client.query() endpoint accepts a primary message string, we concatenate the system instructions, the truncated document text, and a numbered list of all requirements into a single payload.

  4. JSON Structuring: We instruct the LLM to return its answers mapped to the task numbers. This makes the resulting Review_Report.json highly structured and machine-readable for downstream applications.

Would you like to adjust how the script handles long PDFs (e.g., chunking the text to fit specific Ask Sage model limits), or is this architecture ready for your testing?


From <https://gemini.google.com/app/891f8fa939376d16>  Google Gemini 3.1 Pro

Comments

Popular posts from this blog

#2892 Example of ChatGPT o1 Prompt

#3105 Federal and State Spending Idiocy