#4021 LangExtract

#4021 LangExtract

LangExtract is a Python library that uses LLMs to extract structured information from unstructured text documents

https://github.com/google/langextract

#4021 Tell me about LangExtract

LangExtract is a sophisticated data extraction framework designed to bridge the gap between unstructured text and structured, machine-readable data. While several tools exist in the "ETL for LLMs" space, LangExtract focuses specifically on precision and schema adherence using Large Language Models (LLMs).

It is primarily used to pull specific entities, relationships, or attributes out of messy documents (like PDFs, emails, or logs) and format them into JSON, Pydantic objects, or database entries.

### Key Capabilities

LangExtract is built on the philosophy that extraction should be predictable and verifiable. Its core features include:

  • Schema-Driven Extraction: You define a target structure (e.g., a Pydantic model or a JSON schema), and the tool ensures the LLM output matches that structure exactly.

  • Multi-Engine Support: It is typically designed to work across various model providers, including OpenAI, Anthropic, and local models via Ollama or vLLM.

  • Context Chunking: To handle large documents that exceed an LLM's context window, it breaks down text into manageable segments while maintaining the integrity of the data being extracted.

  • Validation & Cleaning: It includes built-in hooks to validate extracted data (e.g., checking if a date is valid or if a value falls within a specific range) before it hits your production database.


### How It Differs from Standard Prompting

If you simply ask an LLM, "Give me the names in this text," you might get a conversational response or a bulleted list. LangExtract formalizes this process:

FeatureBasic PromptingLangExtract / Structured Extraction
Output FormatUnpredictable (Text/Markdown)Strictly Structured (JSON/Objects)
Error HandlingManual retryAutomated validation and re-prompting
ScalabilityHard to automate for 1,000s of filesDesigned for batch processing pipelines
PrecisionHigh chance of "hallucinated" fieldsEnforces schema constraints

### Common Use Cases

  • Invoice Processing: Extracting line items, tax IDs, and totals from diverse PDF layouts.

  • Medical Records: Pulling symptoms, dosages, and dates from clinical notes.

  • Legal Analysis: Identifying parties, effective dates, and termination clauses in contracts.

  • Knowledge Graph Construction: Identifying "Subject-Predicate-Object" triples to populate a graph database.

### Getting Started (Conceptual Example)

In a typical Python workflow, using a tool like LangExtract (or similar libraries like Instructor or LangChain's extraction plates) looks like this:

  1. Define your Model: Create a class representing the data you want (e.g., a Company class with name and ticker).

  2. Pass the Text: Feed the unstructured document into the extractor.

  3. Receive the Object: The tool returns a populated Python object, not just a string of text.

Note: If you are looking for a specific GitHub repository or a niche proprietary tool named "LangExtract," be aware that the ecosystem moves fast. Many developers are currently transitioning from standalone extraction libraries to integrated features within frameworks like LangChain, LlamaIndex, or the Instructor library.

Would you like me to provide a Python code snippet showing how to set up a basic extraction schema using one of these frameworks?


Yes. Extract a software hazard report title, hazard causes, hazard controls for each cause, and verifications for each hazard control

To perform this extraction effectively, we use a nested schema. This ensures that each hazard cause is linked to its specific controls, and each control is linked to its specific verifications, maintaining the parent-child relationships required for safety-critical documentation.

Below is an implementation using the Instructor library and Pydantic, which is the industry-standard approach for forcing LLMs to follow a strict hierarchy.


1. Define the Nested Schema

In this structure, we define the "bottom" level first (Verifications) and work our way up to the Report Title.

from typing import List
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI

# 1. The lowest level: How do we prove the control works?
class Verification(BaseModel):
    title: str = Field(..., description="The ID or title of the verification activity (e.g., Unit Test, Inspection)")
    method: str = Field(..., description="Details on how the control is verified.")

# 2. The control level: What prevents the cause?
class HazardControl(BaseModel):
    title: str = Field(..., description="Name of the control or mitigation strategy.")
    description: str
    verifications: List[Verification] = Field(..., description="List of verifications for this specific control.")

# 3. The cause level: Why does the hazard happen?
class HazardCause(BaseModel):
    title: str = Field(..., description="The specific root cause or contributing factor.")
    description: str
    controls: List[HazardControl] = Field(..., description="Controls mapped directly to this cause.")

# 4. The top level: The Report itself
class HazardReport(BaseModel):
    report_title: str = Field(..., description="The official title of the software hazard report.")
    causes: List[HazardCause]

2. Execute the Extraction

This script sends your unstructured text (like a PDF export or a meeting transcript) to the LLM and returns a fully populated Python object.

# Initialize the patched client
client = instructor.patch(OpenAI())

unstructured_text = """
Hazard Report: Autonomous Flight Software - Unexpected Engine Shutdown.
One primary cause of this hazard is 'Memory Leak in Guidance Module' where the heap 
is exhausted over long durations. To control this, we implement 'Static Memory Allocation' 
for all guidance loops. This control is verified through 'Static Code Analysis (Polyspace)' 
and 'Stress Testing for 48 hours'. 
Another cause is 'Sensor Data Corruption'...
"""

# Extract the structured data
report = client.chat.completions.create(
    model="gpt-4o",
    response_model=HazardReport,
    messages=[
        {"role": "system", "content": "You are a Safety and Mission Assurance specialist."},
        {"role": "user", "content": f"Extract the hazard details from this text: {unstructured_text}"}
    ]
)

# You can now access the data via dot notation
print(f"Report: {report.report_title}")
for cause in report.causes:
    print(f" - Cause: {cause.title}")
    for control in cause.controls:
        print(f"   - Control: {control.title}")
        for v in control.verifications:
            print(f"     - Verification: {v.title}")

Why this is effective for NASA-style S&MA:

  • Traceability: Because the schema is nested, the LLM cannot "float" a verification; it must attach it to a specific control, preserving the Safety Traceability Matrix.

  • Validation: You can add Pydantic validators to ensure that every HazardControl has at least one Verification. If it doesn't, the extraction can be set to automatically re-prompt the LLM to find it or flag it as a safety gap.

  • Downstream Integration: This Python object can be instantly converted to a JSON object, making it ready for injection into a Knowledge Graph (like Neo4j) or a requirements management tool.

Would you like me to show you how to transform this extracted object into a format compatible with a Knowledge Graph (nodes and edges)?


Provide a python app using Docling to extract markdown text from a PDF file, including tables.

Docling (by IBM Research) is currently one of the most powerful tools for PDF-to-Markdown conversion because it uses specialized AI models to understand document layout and reconstruct tables accurately, rather than just stripping text.

Prerequisites

You will need to install the docling library. Note that Docling is a heavy-duty library that downloads AI models (for layout and table analysis) upon its first run.

pip install docling

The Python Application

This script takes a PDF path as input and saves the resulting Markdown (including formatted tables) to a file.

import sys
from pathlib import Path
from docling.document_converter import DocumentConverter

def convert_pdf_to_markdown(source_pdf_path: str, output_dir: str = "output"):
    """
    Converts a PDF to Markdown including tables using Docling.
    """
    # 1. Initialize the DocumentConverter
    # This loads the AI models for layout and table recognition
    converter = DocumentConverter()

    print(f"Converting {source_pdf_path}...")

    try:
        # 2. Perform the conversion
        result = converter.convert(source_pdf_path)

        # 3. Export the document to Markdown format
        # Docling handles tables automatically by converting them to Markdown tables
        markdown_content = result.document.export_to_markdown()

        # 4. Save the output
        pdf_path = Path(source_pdf_path)
        output_folder = Path(output_dir)
        output_folder.mkdir(parents=True, exist_ok=True)
        
        output_file = output_folder / f"{pdf_path.stem}.md"
        
        with open(output_file, "w", encoding="utf-8") as f:
            f.write(markdown_content)

        print(f"Successfully converted! Markdown saved to: {output_file}")
        return markdown_content

    except Exception as e:
        print(f"An error occurred: {e}")
        return None

if __name__ == "__main__":
    # Change 'your_document.pdf' to your actual file path
    input_pdf = "your_document.pdf" 
    
    if Path(input_pdf).exists():
        convert_pdf_to_markdown(input_pdf)
    else:
        print(f"File {input_pdf} not found.")

Key Features of this approach:

  1. High-Fidelity Tables: Unlike standard PDF libraries (like PyPDF2), Docling identifies table boundaries and converts them into standard GitHub-flavored Markdown tables (| Col 1 | Col 2 |).

  2. Structural Intelligence: It identifies headers, footers, and page numbers and can be configured to ignore them, ensuring the Markdown is "clean" for LLM consumption.

  3. OCR Support: Docling has built-in support for OCR (Optical Character Recognition) if your PDF is a scanned image rather than a digital document.

  4. No Manual Parsing: You don't need to write logic to find tables; the DocumentConverter handles the computer vision aspect of identifying rows and columns automatically.

Advanced Configuration (Optional)

If you want to fine-tune how tables are handled (e.g., enabling OCR for scanned documents), you can modify the converter initialization:

from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Enable OCR if the PDF contains images of text
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True  # Ensure table analysis is active

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

When to use this:

  • RAG Pipelines: When you need to feed PDF data into a Large Language Model and tables are critical to the context.

  • Documentation Migration: When converting legacy PDF manuals into a modern Markdown-based wiki or Knowledge Base.

  • Safety Reports: Since you mentioned Hazard Reports earlier, this is the ideal way to ensure that "Cause/Control" tables in those reports stay aligned and readable.


Provide python app to pass each LangExtract python object to LLM=Gemini 2.5 for summarization

This workflow effectively creates a "Safety Intelligence" pipeline. By passing your structured LangExtract objects to Gemini 2.5 Flash (the stable, high-efficiency model as of early 2026), you can generate high-level executive summaries or technical risk assessments while maintaining the strict hierarchy of your hazard reports.

As a NASA subcontractor, this "Summarize-per-Cause" approach is particularly useful for generating the Safety Assessment Report (SAR) content from deep technical hazard analyses.


Prerequisites

In 2026, the official recommendation is to use the google-genai SDK and the instructor library for structured interactions.

Bash
pip install google-genai instructor

Python App: Structured Summarization Pipeline

Python
from google import genai
from pydantic import BaseModel, Field
import instructor

# 1. Define the Summary Schema 
# We want the LLM to return a structured summary, not just a string.
class SafetySummary(BaseModel):
    executive_overview: str = Field(..., description="A 2-sentence summary for management.")
    technical_risk_level: str = Field(..., description="Assessment of risk: Low, Medium, or High.")
    critical_gap_analysis: str = Field(..., description="Identification of any missing or weak verifications.")

# 2. Setup the Gemini 2.5 Client with Instructor
# Note: In 2026, we use 'google-genai' with the instructor.from_genai wrapper
genai_client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
client = instructor.from_genai(genai_client, model="gemini-2.5-flash")

def summarize_hazard_report(hazard_report_obj):
    """
    Takes the HazardReport object (from the previous LangExtract step) 
    and generates a safety summary for each cause.
    """
    print(f"--- Summarizing Report: {hazard_report_obj.report_title} ---\n")
    
    summaries = []

    for cause in hazard_report_obj.causes:
        # We pass the relevant part of the Pydantic object directly to the LLM
        # Gemini 2.5 handles the object-to-string conversion internally or via instructor
        
        response = client.chat.completions.create(
            response_model=SafetySummary,
            messages=[
                {
                    "role": "system", 
                    "content": "You are a NASA Safety & Mission Assurance (S&MA) Lead."
                },
                {
                    "role": "user", 
                    "content": f"Review this Hazard Cause and its controls: {cause.model_dump_json()}"
                }
            ]
        )
        
        summaries.append({
            "cause": cause.title,
            "summary": response
        })

    return summaries

# --- Example Execution ---
# Assuming 'report' is the object we extracted in the previous step
# summaries = summarize_hazard_report(report)

# for item in summaries:
#     print(f"CAUSE: {item['cause']}")
#     print(f"RISK: {item['summary'].technical_risk_level}")
#     print(f"GAP: {item['summary'].critical_gap_analysis}\n")

Why this approach works for S&MA

  1. Contextual Awareness: By passing the model_dump_json() of a specific cause, Gemini 2.5 sees the direct mapping of controls and verifications. It can notice if a control for a "Critical" cause only has a "Document Review" verification and flag it as a gap.

  2. Gemini 2.5 Flash Efficiency: For bulk processing of hundreds of hazard reports, the Flash model provides the best price-to-performance ratio while offering a massive 1M+ token context window, allowing you to summarize entire programs at once if needed.

  3. Strict Typing: Because the output is a SafetySummary object, you can programmatically block any summary that doesn't include a technical_risk_level, ensuring your final report meets NPR 7150.2 compliance.

Next Step

Since you are building these for NASA space programs, would you like me to show how to export these summarized objects directly into a Cypher script to populate a Neo4j Knowledge Graph?


Yes

This is the "connective tissue" of your S&MA digital twin. By moving from isolated PDF summaries to a Neo4j Knowledge Graph, you transform static hazard reports into a navigable network where you can query, for example, "Which software modules are associated with the most 'High' risk causes across the entire Moon to Mars program?"

To achieve this, we map the hierarchy of your HazardReport and SafetySummary objects to a set of Nodes and Relationships.


1. The Mapping Strategy

We will use the neo4j Python driver (v5.x+) and a simple recursive function to ingest your objects.

Pydantic ClassNeo4j LabelRelationshipTarget Label
HazardReport:HazardReportHAS_CAUSE:Cause
HazardCause:CauseHAS_CONTROL:Control
HazardControl:ControlHAS_VERIFICATION:Verification
SafetySummary:SummarySUMMARIZES:Cause

2. Python App: Pydantic to Neo4j Exporter

from neo4j import GraphDatabase

class Neo4jExporter:
    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def close(self):
        self.driver.close()

    def ingest_hazard_data(self, report, summaries):
        with self.driver.session() as session:
            session.execute_write(self._create_graph_structure, report, summaries)

    @staticmethod
    def _create_graph_structure(tx, report, summaries):
        # 1. Create the Main Report Node
        tx.run(
            "MERGE (r:HazardReport {title: $title})",
            title=report.report_title
        )

        for cause, summary_item in zip(report.causes, summaries):
            # 2. Create Cause Node & Link to Report
            tx.run("""
                MATCH (r:HazardReport {title: $report_title})
                MERGE (c:Cause {title: $cause_title})
                SET c.description = $desc
                MERGE (r)-[:HAS_CAUSE]->(c)
                """, 
                report_title=report.report_title, 
                cause_title=cause.title, 
                desc=cause.description
            )

            # 3. Create Summary Node & Link to Cause
            # This is the intelligence from Gemini 2.5
            tx.run("""
                MATCH (c:Cause {title: $cause_title})
                MERGE (s:Summary {id: $summary_id})
                SET s.overview = $overview, s.risk = $risk, s.gaps = $gaps
                MERGE (s)-[:SUMMARIZES]->(c)
                """,
                cause_title=cause.title,
                summary_id=f"sum_{cause.title.replace(' ', '_')}",
                overview=summary_item['summary'].executive_overview,
                risk=summary_item['summary'].technical_risk_level,
                gaps=summary_item['summary'].critical_gap_analysis
            )

            for control in cause.controls:
                # 4. Create Control Node & Link to Cause
                tx.run("""
                    MATCH (c:Cause {title: $cause_title})
                    MERGE (con:Control {title: $con_title})
                    SET con.description = $con_desc
                    MERGE (c)-[:HAS_CONTROL]->(con)
                    """,
                    cause_title=cause.title,
                    con_title=control.title,
                    con_desc=control.description
                )

                for v in control.verifications:
                    # 5. Create Verification Node & Link to Control
                    tx.run("""
                        MATCH (con:Control {title: $con_title})
                        MERGE (v:Verification {title: $v_title})
                        SET v.method = $v_method
                        MERGE (con)-[:HAS_VERIFICATION]->(v)
                        """,
                        con_title=control.title,
                        v_title=v.title,
                        v_method=v.method
                    )

# --- Usage ---
# exporter = Neo4jExporter("bolt://localhost:7687", "neo4j", "password")
# exporter.ingest_hazard_data(report, summaries)

3. Why this is powerful for NASA S&MA

By having this data in Neo4j, you can perform Path Analysis that is impossible in PDFs or Excel:

  • Gap Detection: Find all Control nodes that do not have a :HAS_VERIFICATION relationship.

  • Risk Propagation: Visualize how a single "High" risk Cause in the Flight Software might impact multiple higher-level System Hazards.

  • Cross-Project Impact: If a specific software Verification method (e.g., "Polyspace Analysis") is found to have a flaw, you can instantly see every Control across all reports that relied on it.

Next Step

Would you like me to generate a specific Cypher query you can use in the Neo4j Browser to visualize only the "High Risk" paths where verifications are missing?


Yes

To visualize "High Risk" safety paths where a mitigation has no proof of efficacy, you can use the following Cypher query in your Neo4j Browser.

This query specifically looks for the "Broken Links" in your S&MA chain: it identifies causes labeled as High Risk (by your Gemini 2.5 analysis) where at least one control exists but has no associated verification.


The "Critical Gap" Cypher Query

// Find all High-Risk causes and their unverified controls
MATCH (r:HazardReport)-[:HAS_CAUSE]->(c:Cause)
MATCH (s:Summary)-[:SUMMARIZES]->(c)
WHERE s.risk = "High"

// Traverse to controls
MATCH (c)-[:HAS_CONTROL]->(con:Control)

// Filter for controls that are MISSING a verification relationship
WHERE NOT (con)-[:HAS_VERIFICATION]->(:Verification)

// Return the full path for visual inspection in the graph UI
RETURN r, c, s, con

How to Interpret the Results

  • The Report Node (r): Tells you which mission-level document is affected.

  • The Cause Node (c): Highlights the specific failure mode (e.g., "Memory Leak").

  • The Summary Node (s): Displays the Gemini-generated reasoning for why this was flagged as High Risk.

  • The Control Node (con): Shows the intended mitigation that currently lacks evidence.

S&MA Tip: In a professional audit, this graph view is your "To-Do List." Any Control node that appears in this result set represents a violation of NPR 7150.2 traceability requirements.


Pro-Tip: Global Gap Summary

If you want a quick count of how many unverified controls exist across your entire database, categorized by risk level, use this aggregation:

MATCH (s:Summary)-[:SUMMARIZES]->(c:Cause)-[:HAS_CONTROL]->(con:Control)
WHERE NOT (con)-[:HAS_VERIFICATION]->()
RETURN s.risk AS RiskLevel, count(con) AS UnverifiedControls
ORDER BY RiskLevel DESC

This is particularly useful for program-level dashboards (e.g., Moon to Mars) to show safety maturity over time.


Would you like me to create a Python function that automatically generates a PDF "Gap Report" by running this Cypher query and formatting the results for an audit?

Verifying Cypher query correctness As your graph grows in complexity, ensuring your queries correctly reflect your safety schema is vital for maintaining reliable traceability.


Provide an executive summary, step-by-step of how this application works for a NASA Provider Software Hazard Report

This application transforms unstructured NASA Provider Software Hazard Reports from static PDF documents into a dynamic, auditable, and interconnected Knowledge Graph. By leveraging high-fidelity document parsing and structured LLM extraction, it enables S&MA (Safety and Mission Assurance) teams to automatically identify safety gaps, visualize risk propagation, and ensure compliance with NPR 7150.2 standards at scale.


Step-by-Step Workflow

Step 1: Structural Ingestion (IBM Docling)

The process begins by feeding a provider's PDF report into Docling. Unlike standard text scrapers, Docling uses computer vision to "read" the document layout. It reconstructs complex tables—which often contain the critical mapping of Hazard Causes to Controls—into clean, machine-readable Markdown. This ensures that the structural relationship between data points isn't lost during the conversion.

Step 2: Schema-Driven Extraction (LangExtract)

The Markdown text is passed to LangExtract (using a Pydantic schema). The application instructs the LLM to identify and "bucket" data into a strict hierarchy:

  • Level 1: The Report Title.

  • Level 2: Specific Software Hazard Causes.

  • Level 3: Controls/Mitigations linked to those causes.

  • Level 4: Verification methods (tests, audits) linked to those controls. This stage enforces Traceability; a verification cannot exist in the output unless it is explicitly tied to a control and a cause.

Step 3: AI-Assisted Risk Synthesis (Gemini 2.5)

Each extracted "Cause-Control-Verification" bundle is sent to Gemini 2.5 Flash. The model acts as a virtual Safety Lead, analyzing the bundle to generate:

  • An Executive Overview of the specific risk.

  • A Technical Risk Level (Low/Medium/High).

  • A Gap Analysis (e.g., "Warning: Critical cause for flight software has no dynamic test verification").

Step 4: Knowledge Graph Ingestion (Neo4j)

The structured objects and AI summaries are pushed into a Neo4j Graph Database.

  • Nodes represent the Reports, Causes, and Verifications.

  • Edges (Relationships) represent the logical flow (e.g., CAUSE --[:MITIGATED_BY]--> CONTROL). This creates a "Digital Twin" of the software safety posture across different modules or even different providers.

Step 5: Automated Audit & Visualization (Cypher)

Finally, the user runs Cypher queries to perform instant safety audits. Instead of manually reading hundreds of pages, an analyst can query the graph to "Show all High-Risk causes that lack a Verification node." This provides a real-time, visual heatmap of where the program is most vulnerable.


Key Value for NASA S&MA

  • Eliminates Human Error: Automates the tedious "copy-paste" of hazard matrices into tracking tools.

  • Cross-Program Intelligence: Allows NASA to see if the same hazard cause is appearing across multiple different providers (e.g., SpaceX vs. Boeing).

  • Audit Readiness: Provides 100% provenance, linking every node in the graph back to a specific section of the provider's original report.

Would you like me to create a simple Streamlit Dashboard code snippet that allows you to upload a PDF and see the "High Risk" graph nodes immediately?

NASA Hazard Analysis Process This video demonstrates how to use LLM agents and graph databases for risk assessment, providing a visual example of the technology behind the application discussed.


From <https://gemini.google.com/app/>  Google Gemini 3.0 Pro (Thinking)

Comments

Popular posts from this blog

#2892 Example of ChatGPT o1 Prompt

#3105 Federal and State Spending Idiocy