#4205 Docling Chunk Size, Overlap, Separator Optimization for Embedding
#4205 Docling Chunk Size, Overlap, Separator Optimization for Embedding #4205 Provide a python software application using Docling that can determine best chunk size and overlap, as well as separators for embedding Finding the "perfect" chunk size and overlap is often the most tedious part of building a RAG (Retrieval-Augmented Generation) pipeline. Since Docling excels at understanding document structure (headers, tables, and lists), we can use it to intelligently analyze a document's layout before deciding on splitting parameters. The following application uses Docling to export a document to Markdown, then iterates through different chunking configurations to evaluate them based on basic heuristics like Average Chunk Richness and Context Preservation . The Docling Optimizer Application This script uses the Docling library to parse documents and langchain (optional but recommended for the splitter) to simulate the chunking process. import os from docling.datamode...