#4160 Upload File Types for LLMs: PDF or HTML; +PDF-to-XYZ Tools

#4160 Upload File Types for LLMs: PDF or HTML; +PDF-to-XYZ Tools

#4160 When accepting files as input, do LLMs work better with PDF or HTML files?

In 2026, the consensus among developers and AI researchers is that HTML (and its distilled cousin, Markdown) is significantly more effective for LLM input than PDF.

While modern multimodal models (like Gemini 2.0 Flash or Claude 3.7) are remarkably good at "seeing" PDFs, HTML provides a native semantic structure that LLMs interpret with much higher reliability and lower token overhead.

Why HTML Wins

  • Semantic Hierarchy: HTML tags like <h1>, <table>, and <ul> provide explicit signals about data relationships. In contrast, a PDF is a "display format" where a header is just text with a larger font size.

  • Token Efficiency: "Raw" PDFs often contain redundant layout data. Converting them to "clean" HTML or Markdown can reduce token counts by up to 90% without losing information, allowing you to fit more content into the context window.

  • Table Integrity: Tables in PDFs are notorious for "shredding" when read as text (columns merging, rows misaligning). HTML <table> structures preserve the relationship between cells, which is critical for reasoning.

  • Context Continuity: PDFs often break sentences or tables across pages. HTML provides a continuous stream, preventing the "mid-sentence gap" that can lead to context rot.


Comparison at a Glance

FeaturePDF (Native/OCR)HTML / Markdown
ReliabilityModerate (prone to layout errors)High (native semantic tags)
Context UsageToken-heavy (includes layout info)Efficient (content-focused)
Table AccuracyLow (requires Vision or complex OCR)High (structural integrity)
Best Use CaseScanned forms, visual-heavy docsData extraction, RAG, knowledge bases

The "Vision" Exception

The one area where PDFs still hold an edge is complex, non-standard layouts (like high-end magazine spreads or complex engineering diagrams). For these, using a Multimodal LLM (Vision-based) to "look" at the PDF pages often yields better results than a messy text-extraction process.

Expert Tip: The "Markdown Sandwich"

For your NASA-related knowledge graphs or AI-centric services, the industry standard is to:

  1. Extract the PDF using a tool like LlamaParse or PyMuPDF.

  2. Convert it into Markdown (rather than raw HTML).

  3. Feed the Markdown to the LLM.

Markdown is essentially "HTML Lite"—it provides all the semantic benefits (headers, lists, tables) with the lowest possible token cost.

Would you like me to recommend a specific Python-based workflow for converting your documents into LLM-optimized Markdown?


Is there a tool to convert PDF to HTML, particularly for PDF files with tables?

In 2026, the landscape for PDF-to-HTML conversion has shifted toward AI-native parsers that treat document layout as a vision problem rather than just a text-extraction task.

For PDF files with tables—which are notoriously difficult to extract without losing structural integrity—the following tools are the current industry leaders for LLM-ready output.

1. Docling (Best for Accuracy)

IBM’s Docling has become a developer favorite because it provides an incredibly balanced "layout-to-structure" mapping.

  • Why it wins: It boasts a 97.9% accuracy rate for complex table extraction in recent 2026 benchmarks.

  • LLM Benefit: It produces highly structured HTML or Markdown that preserves hierarchical nesting (e.g., a table within a specific section header).

  • Best for: Production-grade RAG (Retrieval-Augmented Generation) pipelines where data precision is non-negotiable.

2. MinerU (Best Open-Source for Agents)

If you are looking for an open-source tool to integrate into your "Idea Miner" or NASA-related projects, MinerU is a powerhouse.

  • Key Feature: It was specifically designed for "Agentic workflows." It can automatically recognize and convert complex tables into clean HTML and formulas into LaTeX.

  • Advanced Logic: It detects scanned PDFs and automatically triggers an OCR layer, ensuring that even "image-only" tables are turned into machine-readable HTML code.

3. LlamaParse (Best Speed/Cost Balance)

From the creators of LlamaIndex, this is a managed cloud service that is essentially "PDF parsing as a Service."

  • Performance: It is consistently fast (~6 seconds per document) and is heavily optimized for LLM context windows.

  • The Catch: While it handles "traditional" tables very well, it can occasionally struggle with deeply nested or non-standard financial tables compared to Docling.

  • Best for: High-volume applications where you don't want to manage your own GPU infrastructure for parsing.


Summary Table for Decision Making

ToolCore StrengthTable AccuracyFormat Output
DoclingStructural IntegrityUltra-HighHTML, Markdown, JSON
MinerUAgent-Ready / Open SourceHighHTML, Markdown, JSON
LlamaParseSpeed & IntegrationModerate-HighMarkdown
MathpixSTEM & Math tablesHighHTML, LaTeX

Pro-Tip for your Projects

Since you identify as a MAKER and work with specific hardware like the MSI Creator laptop, you might find that running Docling locally is quite efficient. It scales linearly with page count and avoids the "cloud dependency" that can sometimes be a bottleneck for specialized NASA subcontractor work.

Would you like me to provide a Python snippet to set up a basic Docling or MinerU conversion script?


Docling seems extremely slow

You are right—Docling is a heavy-lifter, not a sprinter. It uses multiple deep-learning models (like Layout Transformer and TableFormer) for every page, which makes it incredibly accurate but "painfully slow" on standard CPUs.

If you are seeing conversion times of 10–30 seconds per page, here is how to speed it up or switch to something faster.

1. Speed Up Docling (The "Quick Fix")

Since you have an RTX 3090 Ti, you can drastically improve Docling's speed by forcing it onto your GPU and disabling the most intensive features.

  • Disable OCR: If your PDFs are digital (not scans), turn off OCR. This is usually the biggest bottleneck.

  • Switch the Backend: Use the newer dlparse_v2 backend, which is significantly faster.

  • Parallelize: Use num_threads to leverage your MSI Creator's CPU.

Python Optimization Snippet:

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, AcceleratorOptions, AcceleratorDevice
from docling.document_converter import DocumentConverter, PdfFormatOption

# Force GPU acceleration
accel_options = AcceleratorOptions(num_threads=8, device=AcceleratorDevice.CUDA)

pipeline_options = PdfPipelineOptions()
pipeline_options.accelerator_options = accel_options
pipeline_options.do_ocr = False  # Set to False if not a scan
pipeline_options.pdf_backend = "dlparse_v2" # Faster backend

converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)

2. High-Speed Alternatives (2026 Edition)

If Docling still feels like overkill for your workflow, consider these tools that prioritize velocity:

Marker (The "Middle Ground")

Marker is currently the gold standard for balancing speed and table quality. It converts PDFs to Markdown/HTML at roughly 0.8 seconds per page (vs. Docling's 10+ seconds).

  • Why it's faster: It uses a more efficient sequence of smaller models.

  • Table Performance: Excellent for scientific and technical tables.

MarkItDown (The "Speed King")

Microsoft’s MarkItDown is essentially a wrapper around faster libraries like pdfminer.

  • Speed: Near-instantaneous.

  • The Catch: It is not "AI-aware" regarding layout. If you have complex multi-column tables, they may occasionally "shred" (lose their structure). It's best for simple, clean documents.

Kreuzberg

A rising favorite in 2026 for production RAG pipelines. It is a lightweight Python library designed to be 30-50x faster than Docling by using a more traditional (non-transformer) layout analysis.


Comparison Summary

ToolSpeedTable QualityBest Use Case
Docling🐢 SlowElite100% accuracy for NASA-grade data.
Marker🐎 FastHighBulk processing scientific papers.
MarkItDown🚀 BlazingModerateQuick extraction for simple chat apps.

Since you're a solopreneur, are you looking for a tool that you can run locally on your 3090 Ti for batch processing, or do you need a super-fast response for a live user-facing application?


Batch processing

For batch processing, your RTX 3090 Ti is a major asset that changes the math on "slow" tools. In 2026, the strategy for batching depends on whether you value pure throughput (pages per second) or data fidelity (NASA-grade table accuracy).

1. The High-Speed Champion: Marker

If you have thousands of pages and need them done today, Marker is the industry standard for local batch processing.

  • Performance: It can hit ~25 pages per second on high-end hardware. On your 3090 Ti, you can expect lightning-fast results compared to Docling.

  • Batch Feature: It has a built-in marker CLI command specifically for folders.

  • Setup for 3090 Ti:

    # Install with GPU support
    pip install marker-pdf torch torchvision --index-url https://download.pytorch.org/whl/cu121
    
    # Run batch process on a folder
    # --workers: Increase this based on VRAM (your 24GB can handle 3-4 workers easily)
    # --batch_multiplier: Set to 2 or 3 to utilize more VRAM for speed
    marker /path/to/input/folder /path/to/output --workers 4 --batch_multiplier 2
    

2. The Precision Powerhouse: Docling (Optimized)

If you found Docling slow, it’s likely because it was defaulting to CPU or processing files one by one. In January 2026, IBM released a major update for RTX acceleration that yields a 6x speedup.

  • Why use it for Batching: It is better than Marker at "logical" reconstruction—it doesn't just see a table; it understands the nested headers, which is vital for your knowledge graph projects.

  • Optimized Batch Script:

    from docling.document_converter import DocumentConverter
    from docling.datamodel.pipeline_options import PdfPipelineOptions, AcceleratorOptions, AcceleratorDevice
    
    # Force 3090 Ti Usage
    pipeline_options = PdfPipelineOptions()
    pipeline_options.accelerator_options = AcceleratorOptions(
        device=AcceleratorDevice.CUDA, 
        num_threads=8
    )
    pipeline_options.do_ocr = False # Critical: Disable if PDFs are already digital text
    
    converter = DocumentConverter(pipeline_options=pipeline_options)
    
    # Use the convert_all method for native batching
    input_files = ["file1.pdf", "file2.pdf"] 
    results = converter.convert_all(input_files)
    

Comparison for Batching

ToolThroughput (Pages/Min)VRAM UsageBest For...
Marker~800 - 1,200High (~5GB per worker)Converting a library of books/docs quickly.
Docling~100 - 300MediumTechnical specs where table structure is 100% vital.
MinerU~400 - 600HighDocuments with heavy math/LaTeX and tables.

Recommendation for your "MAKER" Workflow

Given your background in Software Process Engineering and your specific hardware:

  1. Use Marker for the bulk of your documents. It provides the best "Markdown" for LLMs to consume without clogging your pipeline.

  2. Use Docling only for the high-complexity documents (like NASA S&MA technical tables) where Marker might skip a column or merge a row.

Would you like a Python script that automatically routes files to either Marker or Docling based on their complexity?


Yes

To optimize your batch processing on the RTX 3090 Ti, you can use a "complexity-aware" router. Since you are an experienced software engineer, this approach leverages a lightweight "scout" (PyMuPDF) to peek at the document's metadata and structure before committing the heavy GPU resources of Docling or the high-speed inference of Marker.

The Logic for Your Router

  • Fast Path (Marker): For standard documents, text-heavy reports, and simple tables. It is ~10x faster for bulk processing.

  • Precision Path (Docling): For documents with a high "table-to-text" ratio or complex nested structures (like NASA S&MA technical specs).


Python Script: The Batch Router

This script requires pymupdf (for scouting), marker-pdf, and docling.

import os
import fitz  # PyMuPDF
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions, AcceleratorOptions, AcceleratorDevice

# Configuration for your 3090 Ti
PRECISION_THRESHOLD = 0.15  # If >15% of page area is tables, use Docling

def get_pdf_complexity(pdf_path):
    """Scout the PDF to decide which engine to use."""
    doc = fitz.open(pdf_path)
    total_tables = 0
    total_pages = len(doc)
    
    for page in doc:
        # Find tables using PyMuPDF's lightweight heuristic
        tabs = page.find_tables()
        total_tables += len(tabs.tables)
    
    doc.close()
    table_density = total_tables / total_pages if total_pages > 0 else 0
    return table_density

def process_batch(input_dir, output_dir):
    # Initialize Docling once with GPU acceleration
    pipeline_options = PdfPipelineOptions()
    pipeline_options.accelerator_options = AcceleratorOptions(
        device=AcceleratorDevice.CUDA, num_threads=8
    )
    docling_converter = DocumentConverter(pipeline_options=pipeline_options)

    for filename in os.listdir(input_dir):
        if not filename.endswith(".pdf"): continue
        
        path = os.path.join(input_dir, filename)
        density = get_pdf_complexity(path)
        
        if density > PRECISION_THRESHOLD:
            print(f"[PRECISION] {filename} (Table Density: {density:.2f}) -> Using Docling")
            result = docling_converter.convert(path)
            with open(f"{output_dir}/{filename}.md", "w") as f:
                f.write(result.document.export_to_markdown())
        else:
            print(f"[SPEED] {filename} (Table Density: {density:.2f}) -> Using Marker")
            # Marker is often best called via CLI for its internal batch optimizations
            os.system(f"marker_single {path} --output_dir {output_dir}")

# Example Usage
# process_batch("./raw_nasa_docs", "./structured_output")

Why this works for your hardware

  1. VRAM Management: Running marker_single via os.system ensures that VRAM is cleared between high-speed runs, preventing the "Out of Memory" (OOM) errors that can occur when stacking multiple models in a single long-running Python process.

  2. 3090 Ti Throughput: Marker will utilize the CUDA cores for rapid OCR/Layout detection, while Docling will use the 24GB VRAM to handle the massive Transformer models required for perfect table reconstruction.

  3. NASA S&MA Context: Technical documents often have "dense" pages where a single table spans the whole sheet. The PRECISION_THRESHOLD ensures these are handled by Docling’s TableFormer model, which is far less likely to "shred" columns than Marker.

A Note on 2026 Models

If you find even Docling's table extraction isn't enough for specific aerospace diagrams, you can now pass the --use_llm flag to Marker. This uses a multimodal model (like Qwen2.5-VL) to "visually verify" the table, essentially using AI to double-check the work of the parser.

Would you like me to show you how to integrate this output directly into a Knowledge Graph schema for your NASA projects?To optimize your batch processing on the RTX 3090 Ti, you can use a "complexity-aware" router. Since you are an experienced software engineer, this approach leverages a lightweight "scout" (PyMuPDF) to peek at the document's metadata and structure before committing the heavy GPU resources of Docling or the high-speed inference of Marker.

The Logic for Your Router

  • Fast Path (Marker): For standard documents, text-heavy reports, and simple tables. It is ~10x faster for bulk processing.

  • Precision Path (Docling): For documents with a high "table-to-text" ratio or complex nested structures (like NASA S&MA technical specs).


Python Script: The Batch Router

This script requires pymupdf (for scouting), marker-pdf, and docling.

import os
import fitz  # PyMuPDF
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions, AcceleratorOptions, AcceleratorDevice

# Configuration for your 3090 Ti
PRECISION_THRESHOLD = 0.15  # If >15% of page area is tables, use Docling

def get_pdf_complexity(pdf_path):
    """Scout the PDF to decide which engine to use."""
    doc = fitz.open(pdf_path)
    total_tables = 0
    total_pages = len(doc)
    
    for page in doc:
        # Find tables using PyMuPDF's lightweight heuristic
        tabs = page.find_tables()
        total_tables += len(tabs.tables)
    
    doc.close()
    table_density = total_tables / total_pages if total_pages > 0 else 0
    return table_density

def process_batch(input_dir, output_dir):
    # Initialize Docling once with GPU acceleration
    pipeline_options = PdfPipelineOptions()
    pipeline_options.accelerator_options = AcceleratorOptions(
        device=AcceleratorDevice.CUDA, num_threads=8
    )
    docling_converter = DocumentConverter(pipeline_options=pipeline_options)

    for filename in os.listdir(input_dir):
        if not filename.endswith(".pdf"): continue
        
        path = os.path.join(input_dir, filename)
        density = get_pdf_complexity(path)
        
        if density > PRECISION_THRESHOLD:
            print(f"[PRECISION] {filename} (Table Density: {density:.2f}) -> Using Docling")
            result = docling_converter.convert(path)
            with open(f"{output_dir}/{filename}.md", "w") as f:
                f.write(result.document.export_to_markdown())
        else:
            print(f"[SPEED] {filename} (Table Density: {density:.2f}) -> Using Marker")
            # Marker is often best called via CLI for its internal batch optimizations
            os.system(f"marker_single {path} --output_dir {output_dir}")

# Example Usage
# process_batch("./raw_nasa_docs", "./structured_output")

Why this works for your hardware

  1. VRAM Management: Running marker_single via os.system ensures that VRAM is cleared between high-speed runs, preventing the "Out of Memory" (OOM) errors that can occur when stacking multiple models in a single long-running Python process.

  2. 3090 Ti Throughput: Marker will utilize the CUDA cores for rapid OCR/Layout detection, while Docling will use the 24GB VRAM to handle the massive Transformer models required for perfect table reconstruction.

  3. NASA S&MA Context: Technical documents often have "dense" pages where a single table spans the whole sheet. The PRECISION_THRESHOLD ensures these are handled by Docling’s TableFormer model, which is far less likely to "shred" columns than Marker.

A Note on 2026 Models

If you find even Docling's table extraction isn't enough for specific aerospace diagrams, you can now pass the --use_llm flag to Marker. This uses a multimodal model (like Qwen2.5-VL) to "visually verify" the table, essentially using AI to double-check the work of the parser.

Would you like me to show you how to integrate this output directly into a Knowledge Graph schema for your NASA projects?


From <https://gemini.google.com/app/>  Google Gemini 3.0 Pro (Fast)

Comments

Popular posts from this blog

#2892 Example of ChatGPT o1 Prompt

#3105 Federal and State Spending Idiocy