#4021 LangExtract
#4021 LangExtract
LangExtract is a Python library that uses LLMs to extract structured information from unstructured text documents
https://github.com/google/langextract
#4021 Tell me about LangExtract
LangExtract is a sophisticated data extraction framework designed to bridge the gap between unstructured text and structured, machine-readable data. While several tools exist in the "ETL for LLMs" space, LangExtract focuses specifically on precision and schema adherence using Large Language Models (LLMs).
It is primarily used to pull specific entities, relationships, or attributes out of messy documents (like PDFs, emails, or logs) and format them into JSON, Pydantic objects, or database entries.
### Key Capabilities
LangExtract is built on the philosophy that extraction should be predictable and verifiable. Its core features include:
Schema-Driven Extraction: You define a target structure (e.g., a Pydantic model or a JSON schema), and the tool ensures the LLM output matches that structure exactly.
Multi-Engine Support: It is typically designed to work across various model providers, including OpenAI, Anthropic, and local models via Ollama or vLLM.
Context Chunking: To handle large documents that exceed an LLM's context window, it breaks down text into manageable segments while maintaining the integrity of the data being extracted.
Validation & Cleaning: It includes built-in hooks to validate extracted data (e.g., checking if a date is valid or if a value falls within a specific range) before it hits your production database.
### How It Differs from Standard Prompting
If you simply ask an LLM, "Give me the names in this text," you might get a conversational response or a bulleted list. LangExtract formalizes this process:
| Feature | Basic Prompting | LangExtract / Structured Extraction |
| Output Format | Unpredictable (Text/Markdown) | Strictly Structured (JSON/Objects) |
| Error Handling | Manual retry | Automated validation and re-prompting |
| Scalability | Hard to automate for 1,000s of files | Designed for batch processing pipelines |
| Precision | High chance of "hallucinated" fields | Enforces schema constraints |
### Common Use Cases
Invoice Processing: Extracting line items, tax IDs, and totals from diverse PDF layouts.
Medical Records: Pulling symptoms, dosages, and dates from clinical notes.
Legal Analysis: Identifying parties, effective dates, and termination clauses in contracts.
Knowledge Graph Construction: Identifying "Subject-Predicate-Object" triples to populate a graph database.
### Getting Started (Conceptual Example)
In a typical Python workflow, using a tool like LangExtract (or similar libraries like Instructor or LangChain's extraction plates) looks like this:
Define your Model: Create a class representing the data you want (e.g., a
Companyclass withnameandticker).Pass the Text: Feed the unstructured document into the extractor.
Receive the Object: The tool returns a populated Python object, not just a string of text.
Note: If you are looking for a specific GitHub repository or a niche proprietary tool named "LangExtract," be aware that the ecosystem moves fast. Many developers are currently transitioning from standalone extraction libraries to integrated features within frameworks like LangChain, LlamaIndex, or the Instructor library.
Would you like me to provide a Python code snippet showing how to set up a basic extraction schema using one of these frameworks?
Yes. Extract a software hazard report title, hazard causes, hazard controls for each cause, and verifications for each hazard control
To perform this extraction effectively, we use a nested schema. This ensures that each hazard cause is linked to its specific controls, and each control is linked to its specific verifications, maintaining the parent-child relationships required for safety-critical documentation.
Below is an implementation using the Instructor library and Pydantic, which is the industry-standard approach for forcing LLMs to follow a strict hierarchy.
1. Define the Nested Schema
In this structure, we define the "bottom" level first (Verifications) and work our way up to the Report Title.
from typing import List
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI
# 1. The lowest level: How do we prove the control works?
class Verification(BaseModel):
title: str = Field(..., description="The ID or title of the verification activity (e.g., Unit Test, Inspection)")
method: str = Field(..., description="Details on how the control is verified.")
# 2. The control level: What prevents the cause?
class HazardControl(BaseModel):
title: str = Field(..., description="Name of the control or mitigation strategy.")
description: str
verifications: List[Verification] = Field(..., description="List of verifications for this specific control.")
# 3. The cause level: Why does the hazard happen?
class HazardCause(BaseModel):
title: str = Field(..., description="The specific root cause or contributing factor.")
description: str
controls: List[HazardControl] = Field(..., description="Controls mapped directly to this cause.")
# 4. The top level: The Report itself
class HazardReport(BaseModel):
report_title: str = Field(..., description="The official title of the software hazard report.")
causes: List[HazardCause]
2. Execute the Extraction
This script sends your unstructured text (like a PDF export or a meeting transcript) to the LLM and returns a fully populated Python object.
# Initialize the patched client
client = instructor.patch(OpenAI())
unstructured_text = """
Hazard Report: Autonomous Flight Software - Unexpected Engine Shutdown.
One primary cause of this hazard is 'Memory Leak in Guidance Module' where the heap
is exhausted over long durations. To control this, we implement 'Static Memory Allocation'
for all guidance loops. This control is verified through 'Static Code Analysis (Polyspace)'
and 'Stress Testing for 48 hours'.
Another cause is 'Sensor Data Corruption'...
"""
# Extract the structured data
report = client.chat.completions.create(
model="gpt-4o",
response_model=HazardReport,
messages=[
{"role": "system", "content": "You are a Safety and Mission Assurance specialist."},
{"role": "user", "content": f"Extract the hazard details from this text: {unstructured_text}"}
]
)
# You can now access the data via dot notation
print(f"Report: {report.report_title}")
for cause in report.causes:
print(f" - Cause: {cause.title}")
for control in cause.controls:
print(f" - Control: {control.title}")
for v in control.verifications:
print(f" - Verification: {v.title}")
Why this is effective for NASA-style S&MA:
Traceability: Because the schema is nested, the LLM cannot "float" a verification; it must attach it to a specific control, preserving the Safety Traceability Matrix.
Validation: You can add Pydantic validators to ensure that every
HazardControlhas at least oneVerification. If it doesn't, the extraction can be set to automatically re-prompt the LLM to find it or flag it as a safety gap.Downstream Integration: This Python object can be instantly converted to a JSON object, making it ready for injection into a Knowledge Graph (like Neo4j) or a requirements management tool.
Would you like me to show you how to transform this extracted object into a format compatible with a Knowledge Graph (nodes and edges)?
Provide a python app using Docling to extract markdown text from a PDF file, including tables.
Docling (by IBM Research) is currently one of the most powerful tools for PDF-to-Markdown conversion because it uses specialized AI models to understand document layout and reconstruct tables accurately, rather than just stripping text.
Prerequisites
You will need to install the docling library. Note that Docling is a heavy-duty library that downloads AI models (for layout and table analysis) upon its first run.
pip install docling
The Python Application
This script takes a PDF path as input and saves the resulting Markdown (including formatted tables) to a file.
import sys
from pathlib import Path
from docling.document_converter import DocumentConverter
def convert_pdf_to_markdown(source_pdf_path: str, output_dir: str = "output"):
"""
Converts a PDF to Markdown including tables using Docling.
"""
# 1. Initialize the DocumentConverter
# This loads the AI models for layout and table recognition
converter = DocumentConverter()
print(f"Converting {source_pdf_path}...")
try:
# 2. Perform the conversion
result = converter.convert(source_pdf_path)
# 3. Export the document to Markdown format
# Docling handles tables automatically by converting them to Markdown tables
markdown_content = result.document.export_to_markdown()
# 4. Save the output
pdf_path = Path(source_pdf_path)
output_folder = Path(output_dir)
output_folder.mkdir(parents=True, exist_ok=True)
output_file = output_folder / f"{pdf_path.stem}.md"
with open(output_file, "w", encoding="utf-8") as f:
f.write(markdown_content)
print(f"Successfully converted! Markdown saved to: {output_file}")
return markdown_content
except Exception as e:
print(f"An error occurred: {e}")
return None
if __name__ == "__main__":
# Change 'your_document.pdf' to your actual file path
input_pdf = "your_document.pdf"
if Path(input_pdf).exists():
convert_pdf_to_markdown(input_pdf)
else:
print(f"File {input_pdf} not found.")
Key Features of this approach:
High-Fidelity Tables: Unlike standard PDF libraries (like PyPDF2), Docling identifies table boundaries and converts them into standard GitHub-flavored Markdown tables (
| Col 1 | Col 2 |).Structural Intelligence: It identifies headers, footers, and page numbers and can be configured to ignore them, ensuring the Markdown is "clean" for LLM consumption.
OCR Support: Docling has built-in support for OCR (Optical Character Recognition) if your PDF is a scanned image rather than a digital document.
No Manual Parsing: You don't need to write logic to find tables; the
DocumentConverterhandles the computer vision aspect of identifying rows and columns automatically.
Advanced Configuration (Optional)
If you want to fine-tune how tables are handled (e.g., enabling OCR for scanned documents), you can modify the converter initialization:
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
# Enable OCR if the PDF contains images of text
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True # Ensure table analysis is active
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
When to use this:
RAG Pipelines: When you need to feed PDF data into a Large Language Model and tables are critical to the context.
Documentation Migration: When converting legacy PDF manuals into a modern Markdown-based wiki or Knowledge Base.
Safety Reports: Since you mentioned Hazard Reports earlier, this is the ideal way to ensure that "Cause/Control" tables in those reports stay aligned and readable.
Provide python app to pass each LangExtract python object to LLM=Gemini 2.5 for summarization
This workflow effectively creates a "Safety Intelligence" pipeline. By passing your structured LangExtract objects to Gemini 2.5 Flash (the stable, high-efficiency model as of early 2026), you can generate high-level executive summaries or technical risk assessments while maintaining the strict hierarchy of your hazard reports.
As a NASA subcontractor, this "Summarize-per-Cause" approach is particularly useful for generating the Safety Assessment Report (SAR) content from deep technical hazard analyses.
Prerequisites
In 2026, the official recommendation is to use the google-genai SDK and the instructor library for structured interactions.
pip install google-genai instructor
Python App: Structured Summarization Pipeline
from google import genai
from pydantic import BaseModel, Field
import instructor
# 1. Define the Summary Schema
# We want the LLM to return a structured summary, not just a string.
class SafetySummary(BaseModel):
executive_overview: str = Field(..., description="A 2-sentence summary for management.")
technical_risk_level: str = Field(..., description="Assessment of risk: Low, Medium, or High.")
critical_gap_analysis: str = Field(..., description="Identification of any missing or weak verifications.")
# 2. Setup the Gemini 2.5 Client with Instructor
# Note: In 2026, we use 'google-genai' with the instructor.from_genai wrapper
genai_client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
client = instructor.from_genai(genai_client, model="gemini-2.5-flash")
def summarize_hazard_report(hazard_report_obj):
"""
Takes the HazardReport object (from the previous LangExtract step)
and generates a safety summary for each cause.
"""
print(f"--- Summarizing Report: {hazard_report_obj.report_title} ---\n")
summaries = []
for cause in hazard_report_obj.causes:
# We pass the relevant part of the Pydantic object directly to the LLM
# Gemini 2.5 handles the object-to-string conversion internally or via instructor
response = client.chat.completions.create(
response_model=SafetySummary,
messages=[
{
"role": "system",
"content": "You are a NASA Safety & Mission Assurance (S&MA) Lead."
},
{
"role": "user",
"content": f"Review this Hazard Cause and its controls: {cause.model_dump_json()}"
}
]
)
summaries.append({
"cause": cause.title,
"summary": response
})
return summaries
# --- Example Execution ---
# Assuming 'report' is the object we extracted in the previous step
# summaries = summarize_hazard_report(report)
# for item in summaries:
# print(f"CAUSE: {item['cause']}")
# print(f"RISK: {item['summary'].technical_risk_level}")
# print(f"GAP: {item['summary'].critical_gap_analysis}\n")
Why this approach works for S&MA
Contextual Awareness: By passing the
model_dump_json()of a specific cause, Gemini 2.5 sees the direct mapping of controls and verifications. It can notice if a control for a "Critical" cause only has a "Document Review" verification and flag it as a gap.Gemini 2.5 Flash Efficiency: For bulk processing of hundreds of hazard reports, the Flash model provides the best price-to-performance ratio while offering a massive 1M+ token context window, allowing you to summarize entire programs at once if needed.
Strict Typing: Because the output is a
SafetySummaryobject, you can programmatically block any summary that doesn't include atechnical_risk_level, ensuring your final report meets NPR 7150.2 compliance.
Next Step
Since you are building these for NASA space programs, would you like me to show how to export these summarized objects directly into a Cypher script to populate a Neo4j Knowledge Graph?
Yes
This is the "connective tissue" of your S&MA digital twin. By moving from isolated PDF summaries to a Neo4j Knowledge Graph, you transform static hazard reports into a navigable network where you can query, for example, "Which software modules are associated with the most 'High' risk causes across the entire Moon to Mars program?"
To achieve this, we map the hierarchy of your HazardReport and SafetySummary objects to a set of Nodes and Relationships.
1. The Mapping Strategy
We will use the neo4j Python driver (v5.x+) and a simple recursive function to ingest your objects.
| Pydantic Class | Neo4j Label | Relationship | Target Label |
| HazardReport | :HazardReport | HAS_CAUSE | :Cause |
| HazardCause | :Cause | HAS_CONTROL | :Control |
| HazardControl | :Control | HAS_VERIFICATION | :Verification |
| SafetySummary | :Summary | SUMMARIZES | :Cause |
2. Python App: Pydantic to Neo4j Exporter
from neo4j import GraphDatabase
class Neo4jExporter:
def __init__(self, uri, user, password):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def close(self):
self.driver.close()
def ingest_hazard_data(self, report, summaries):
with self.driver.session() as session:
session.execute_write(self._create_graph_structure, report, summaries)
@staticmethod
def _create_graph_structure(tx, report, summaries):
# 1. Create the Main Report Node
tx.run(
"MERGE (r:HazardReport {title: $title})",
title=report.report_title
)
for cause, summary_item in zip(report.causes, summaries):
# 2. Create Cause Node & Link to Report
tx.run("""
MATCH (r:HazardReport {title: $report_title})
MERGE (c:Cause {title: $cause_title})
SET c.description = $desc
MERGE (r)-[:HAS_CAUSE]->(c)
""",
report_title=report.report_title,
cause_title=cause.title,
desc=cause.description
)
# 3. Create Summary Node & Link to Cause
# This is the intelligence from Gemini 2.5
tx.run("""
MATCH (c:Cause {title: $cause_title})
MERGE (s:Summary {id: $summary_id})
SET s.overview = $overview, s.risk = $risk, s.gaps = $gaps
MERGE (s)-[:SUMMARIZES]->(c)
""",
cause_title=cause.title,
summary_id=f"sum_{cause.title.replace(' ', '_')}",
overview=summary_item['summary'].executive_overview,
risk=summary_item['summary'].technical_risk_level,
gaps=summary_item['summary'].critical_gap_analysis
)
for control in cause.controls:
# 4. Create Control Node & Link to Cause
tx.run("""
MATCH (c:Cause {title: $cause_title})
MERGE (con:Control {title: $con_title})
SET con.description = $con_desc
MERGE (c)-[:HAS_CONTROL]->(con)
""",
cause_title=cause.title,
con_title=control.title,
con_desc=control.description
)
for v in control.verifications:
# 5. Create Verification Node & Link to Control
tx.run("""
MATCH (con:Control {title: $con_title})
MERGE (v:Verification {title: $v_title})
SET v.method = $v_method
MERGE (con)-[:HAS_VERIFICATION]->(v)
""",
con_title=control.title,
v_title=v.title,
v_method=v.method
)
# --- Usage ---
# exporter = Neo4jExporter("bolt://localhost:7687", "neo4j", "password")
# exporter.ingest_hazard_data(report, summaries)
3. Why this is powerful for NASA S&MA
By having this data in Neo4j, you can perform Path Analysis that is impossible in PDFs or Excel:
Gap Detection: Find all
Controlnodes that do not have a:HAS_VERIFICATIONrelationship.Risk Propagation: Visualize how a single "High" risk
Causein the Flight Software might impact multiple higher-level System Hazards.Cross-Project Impact: If a specific software
Verificationmethod (e.g., "Polyspace Analysis") is found to have a flaw, you can instantly see everyControlacross all reports that relied on it.
Next Step
Would you like me to generate a specific Cypher query you can use in the Neo4j Browser to visualize only the "High Risk" paths where verifications are missing?
Yes
To visualize "High Risk" safety paths where a mitigation has no proof of efficacy, you can use the following Cypher query in your Neo4j Browser.
This query specifically looks for the "Broken Links" in your S&MA chain: it identifies causes labeled as High Risk (by your Gemini 2.5 analysis) where at least one control exists but has no associated verification.
The "Critical Gap" Cypher Query
// Find all High-Risk causes and their unverified controls
MATCH (r:HazardReport)-[:HAS_CAUSE]->(c:Cause)
MATCH (s:Summary)-[:SUMMARIZES]->(c)
WHERE s.risk = "High"
// Traverse to controls
MATCH (c)-[:HAS_CONTROL]->(con:Control)
// Filter for controls that are MISSING a verification relationship
WHERE NOT (con)-[:HAS_VERIFICATION]->(:Verification)
// Return the full path for visual inspection in the graph UI
RETURN r, c, s, con
How to Interpret the Results
The Report Node (
r): Tells you which mission-level document is affected.The Cause Node (
c): Highlights the specific failure mode (e.g., "Memory Leak").The Summary Node (
s): Displays the Gemini-generated reasoning for why this was flagged as High Risk.The Control Node (
con): Shows the intended mitigation that currently lacks evidence.
S&MA Tip: In a professional audit, this graph view is your "To-Do List." Any
Controlnode that appears in this result set represents a violation of NPR 7150.2 traceability requirements.
Pro-Tip: Global Gap Summary
If you want a quick count of how many unverified controls exist across your entire database, categorized by risk level, use this aggregation:
MATCH (s:Summary)-[:SUMMARIZES]->(c:Cause)-[:HAS_CONTROL]->(con:Control)
WHERE NOT (con)-[:HAS_VERIFICATION]->()
RETURN s.risk AS RiskLevel, count(con) AS UnverifiedControls
ORDER BY RiskLevel DESC
This is particularly useful for program-level dashboards (e.g., Moon to Mars) to show safety maturity over time.
Would you like me to create a Python function that automatically generates a PDF "Gap Report" by running this Cypher query and formatting the results for an audit?
Provide an executive summary, step-by-step of how this application works for a NASA Provider Software Hazard Report
This application transforms unstructured NASA Provider Software Hazard Reports from static PDF documents into a dynamic, auditable, and interconnected Knowledge Graph. By leveraging high-fidelity document parsing and structured LLM extraction, it enables S&MA (Safety and Mission Assurance) teams to automatically identify safety gaps, visualize risk propagation, and ensure compliance with NPR 7150.2 standards at scale.
Step-by-Step Workflow
Step 1: Structural Ingestion (IBM Docling)
The process begins by feeding a provider's PDF report into Docling. Unlike standard text scrapers, Docling uses computer vision to "read" the document layout. It reconstructs complex tables—which often contain the critical mapping of Hazard Causes to Controls—into clean, machine-readable Markdown. This ensures that the structural relationship between data points isn't lost during the conversion.
Step 2: Schema-Driven Extraction (LangExtract)
The Markdown text is passed to LangExtract (using a Pydantic schema). The application instructs the LLM to identify and "bucket" data into a strict hierarchy:
Level 1: The Report Title.
Level 2: Specific Software Hazard Causes.
Level 3: Controls/Mitigations linked to those causes.
Level 4: Verification methods (tests, audits) linked to those controls. This stage enforces Traceability; a verification cannot exist in the output unless it is explicitly tied to a control and a cause.
Step 3: AI-Assisted Risk Synthesis (Gemini 2.5)
Each extracted "Cause-Control-Verification" bundle is sent to Gemini 2.5 Flash. The model acts as a virtual Safety Lead, analyzing the bundle to generate:
An Executive Overview of the specific risk.
A Technical Risk Level (Low/Medium/High).
A Gap Analysis (e.g., "Warning: Critical cause for flight software has no dynamic test verification").
Step 4: Knowledge Graph Ingestion (Neo4j)
The structured objects and AI summaries are pushed into a Neo4j Graph Database.
Nodes represent the Reports, Causes, and Verifications.
Edges (Relationships) represent the logical flow (e.g.,
CAUSE --[:MITIGATED_BY]--> CONTROL). This creates a "Digital Twin" of the software safety posture across different modules or even different providers.
Step 5: Automated Audit & Visualization (Cypher)
Finally, the user runs Cypher queries to perform instant safety audits. Instead of manually reading hundreds of pages, an analyst can query the graph to "Show all High-Risk causes that lack a Verification node." This provides a real-time, visual heatmap of where the program is most vulnerable.
Key Value for NASA S&MA
Eliminates Human Error: Automates the tedious "copy-paste" of hazard matrices into tracking tools.
Cross-Program Intelligence: Allows NASA to see if the same hazard cause is appearing across multiple different providers (e.g., SpaceX vs. Boeing).
Audit Readiness: Provides 100% provenance, linking every node in the graph back to a specific section of the provider's original report.
Would you like me to create a simple Streamlit Dashboard code snippet that allows you to upload a PDF and see the "High Risk" graph nodes immediately?
From <https://gemini.google.com/app/> Google Gemini 3.0 Pro (Thinking)

Comments
Post a Comment