#4205 Docling Chunk Size, Overlap, Separator Optimization for Embedding
#4205 Docling Chunk Size, Overlap, Separator Optimization for Embedding
#4205 Provide a python software application using Docling that can determine best chunk size and overlap, as well as separators for embedding
Finding the "perfect" chunk size and overlap is often the most tedious part of building a RAG (Retrieval-Augmented Generation) pipeline. Since Docling excels at understanding document structure (headers, tables, and lists), we can use it to intelligently analyze a document's layout before deciding on splitting parameters.
The following application uses Docling to export a document to Markdown, then iterates through different chunking configurations to evaluate them based on basic heuristics like Average Chunk Richness and Context Preservation.
The Docling Optimizer Application
This script uses the Docling library to parse documents and langchain (optional but recommended for the splitter) to simulate the chunking process.
import os
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter
from langchain_text_splitters import RecursiveCharacterTextSplitter
class ChunkOptimizer:
def __init__(self, source_path):
self.source_path = source_path
self.converter = DocumentConverter()
def get_document_text(self):
"""Converts document to Markdown using Docling for structural integrity."""
result = self.converter.convert(self.source_path)
# Exporting to markdown preserves headers which are vital for separators
return result.document.export_to_markdown()
def analyze_separators(self, text):
"""Identifies the best separators based on document structure."""
potential_separators = ["\n# ", "\n## ", "\n### ", "\n\n", "\n", " ", ""]
found = [sep for sep in potential_separators if sep in text]
return found
def evaluate_configs(self, text, chunk_sizes, overlaps):
"""Simulates different splits and calculates basic metrics."""
results = []
separators = self.analyze_separators(text)
print(f"{'Size':<8} | {'Overlap':<8} | {'Count':<8} | {'Avg Length':<10}")
print("-" * 45)
for size in chunk_sizes:
for overlap in overlaps:
if overlap >= size:
continue
splitter = RecursiveCharacterTextSplitter(
chunk_size=size,
chunk_overlap=overlap,
separators=separators
)
chunks = splitter.split_text(text)
avg_len = sum(len(c) for c in chunks) / len(chunks) if chunks else 0
results.append({
"size": size,
"overlap": overlap,
"count": len(chunks),
"avg_len": avg_len
})
print(f"{size:<8} | {overlap:<8} | {len(chunks):<8} | {avg_len:<10.2f}")
return results
# --- Execution ---
if __name__ == "__main__":
# Path to your PDF, Docx, or HTML
FILE_PATH = "your_document.pdf"
if os.path.exists(FILE_PATH):
optimizer = ChunkOptimizer(FILE_PATH)
print(f"--- Processing: {FILE_PATH} ---")
md_text = optimizer.get_document_text()
# Test ranges
sizes = [512, 1024, 2048]
overlaps = [50, 100, 200]
optimizer.evaluate_configs(md_text, sizes, overlaps)
else:
print("Please provide a valid file path.")
Why This Approach Works
1. Docling's Structural Advantage
Standard PDF parsers often lose the "hierarchy" of a document. Docling recognizes headers (#, ##). By converting to Markdown first, we ensure that the RecursiveCharacterTextSplitter can use these headers as primary separators. This prevents a chunk from being sliced in the middle of a vital section title.
2. Choosing Your Separators
Based on the Docling output, your separators list should usually follow this priority:
\n#,\n##(Headers)\n\n(Paragraphs)\n(Lines).(Sentences)
3. Determining "Best"
While "best" is subjective to your embedding model (e.g., text-embedding-3-small likes ~512-800 tokens), use these rules of thumb:
High Overlap (15-20%): Use this if your document has complex cross-references.
Large Chunks (1024+): Use this for creative writing or legal docs where context is dense.
Small Chunks (~512): Use this for granular fact retrieval or technical manuals.
Pro-Tip: The "Goldilocks" Method
If you want to automate the selection, look for the configuration where the Average Length is closest to 80% of your Chunk Size. This indicates the splitter is finding natural break points (separators) rather than being forced to cut mid-sentence because it ran out of character space.
Would you like me to add a function that calculates the "Cost" of these different configurations based on a specific LLM's pricing?
From <https://gemini.google.com/app/> Google Gemini 3.0 (Fast)
Comments
Post a Comment