#4296 Chunking Markdown; +Embedding with Nomic and ChromaDB

#4296 Chunking Markdown; +Embedding with Nomic and ChromaDB

#4296 Provide Python code to parse markdown text for logical chunking, with metadata for each chunk

To handle "logical chunking," the most effective approach is to parse Markdown by its structure (headers) rather than just character counts. This ensures that a section about "Installation" stays together rather than being sliced in half.

We can use the markdown-it-py library because it’s fast and provides a clear Abstract Syntax Tree (AST), making it easy to track which header a piece of text belongs to.

Python Implementation: Header-Based Chunker

First, install the library: pip install markdown-it-py

from markdown_it import MarkdownIt

def chunk_markdown(text):
    md = MarkdownIt()
    tokens = md.parse(text)
    
    chunks = []
    current_chunk = {"content": [], "metadata": {"headers": []}}
    header_stack = []

    for i, token in enumerate(tokens):
        # Handle Headers
        if token.type == "heading_open":
            # If we already have content, save the previous chunk
            if current_chunk["content"]:
                current_chunk["content"] = "\n".join(current_chunk["content"]).strip()
                chunks.append(current_chunk)
            
            # Get header level and content
            level = int(token.tag[1])
            content_token = tokens[i + 1]
            header_text = content_token.content
            
            # Manage header hierarchy for metadata
            header_stack = header_stack[:level - 1]
            header_stack.append(header_text)
            
            # Start new chunk
            current_chunk = {
                "content": [f"{'#' * level} {header_text}"],
                "metadata": {
                    "headers": list(header_stack),
                    "level": level,
                    "title": header_text
                }
            }
            
        # Handle Paragraphs and Lists
        elif token.type in ["paragraph_open", "bullet_list_open", "ordered_list_open"]:
            # Logic to find the inline content following the open tag
            j = i + 1
            while j < len(tokens) and tokens[j].type != token.type.replace("_open", "_close"):
                if tokens[j].type == "inline":
                    current_chunk["content"].append(tokens[j].content)
                j += 1

    # Add the final chunk
    if current_chunk["content"]:
        current_chunk["content"] = "\n".join(current_chunk["content"]).strip()
        chunks.append(current_chunk)
        
    return chunks

# --- Example Usage ---
markdown_input = """
# Project Alpha
This is the intro.

## Installation
1. Run pip install.
2. Configure settings.

## Usage
Refer to the docs.
"""

result = chunk_markdown(markdown_input)

for chunk in result:
    print(f"Metadata: {chunk['metadata']}")
    print(f"Content: {chunk['content']}\n{'-'*20}")

Why this approach works

  • Context Preservation: Each chunk knows its "breadcrumb" path (e.g., Project Alpha > Installation). This is vital for RAG (Retrieval-Augmented Generation) applications.

  • Granularity: It splits at every header. If your sections are too long, you can add a secondary "Length Splitter" inside each logical chunk.

  • Metadata Richness: You can easily extend the metadata dictionary to include line numbers, word counts, or even extracted keywords.

Refinement for Large Sections

If a single header contains 5,000 words, logical chunking by header alone might exceed your LLM's context window. In that case, you'd want to wrap the output of the function above in a Recursive Character Splitter.

Note: If you are building a production RAG pipeline, libraries like LangChain or LlamaIndex have MarkdownHeaderTextSplitter classes built-in that perform this exact logic.


Would you like me to add a secondary "token-count" limit to this code so it splits extra-long sections while keeping the metadata?


Yes

To implement a secondary splitting mechanism (by length) while preserving logical metadata, we can combine header-based parsing with a recursive text splitter.

This approach ensures that if a single section (e.g., a long "Introduction") exceeds your token/character limit, it is broken down into smaller pieces that still "know" they belong to the "Introduction" section of "Project Alpha."

Python Implementation

This script uses markdown-it-py for structure and a custom recursive function for length management.

from markdown_it import MarkdownIt

def split_text_by_length(text, max_length, overlap=100):
    """
    Splits a long string into smaller chunks of max_length, 
    attempting to split at newlines or spaces.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + max_length
        if end >= len(text):
            chunks.append(text[start:])
            break
        
        # Try to find a logical break point (newline or space) within the last 20% of the window
        break_point = text.rfind('\n', start + int(max_length * 0.8), end)
        if break_point == -1:
            break_point = text.rfind(' ', start + int(max_length * 0.8), end)
        
        if break_point != -1:
            end = break_point
        
        chunks.append(text[start:end].strip())
        start = end - overlap # Backtrack for overlap context
    return chunks

def chunk_markdown_with_limit(text, max_chunk_size=1000, overlap=100):
    md = MarkdownIt()
    tokens = md.parse(text)
    
    logical_chunks = []
    current_content = []
    header_stack = []
    
    # --- Phase 1: Logical Grouping ---
    for i, token in enumerate(tokens):
        if token.type == "heading_open":
            # Save previous section if it exists
            if current_content:
                logical_chunks.append({
                    "content": "\n".join(current_content).strip(),
                    "headers": list(header_stack)
                })
            
            level = int(token.tag[1])
            header_text = tokens[i + 1].content
            header_stack = header_stack[:level - 1]
            header_stack.append(header_text)
            current_content = [f"{'#' * level} {header_text}"]
            
        elif token.type in ["paragraph_open", "bullet_list_open", "ordered_list_open"]:
            j = i + 1
            while j < len(tokens) and tokens[j].type != token.type.replace("_open", "_close"):
                if tokens[j].type == "inline":
                    current_content.append(tokens[j].content)
                j += 1
                
    # Add final section
    if current_content:
        logical_chunks.append({
            "content": "\n".join(current_content).strip(),
            "headers": list(header_stack)
        })

    # --- Phase 2: Length Splitting ---
    final_chunks = []
    for section in logical_chunks:
        if len(section["content"]) > max_chunk_size:
            sub_chunks = split_text_by_length(section["content"], max_chunk_size, overlap)
            for idx, sub in enumerate(sub_chunks):
                final_chunks.append({
                    "content": sub,
                    "metadata": {
                        **section,
                        "is_split": True,
                        "sub_index": idx,
                        "total_subs": len(sub_chunks),
                        "content": None # Clear raw section content from metadata
                    }
                })
        else:
            final_chunks.append({
                "content": section["content"],
                "metadata": {
                    "headers": section["headers"],
                    "is_split": False
                }
            })
            
    return final_chunks

# --- Example Usage ---
markdown_input = """
# Deep Learning Guide
## Neural Networks
A very long explanation of backpropagation... (imagine 5000 words here)
## Optimization
Discussion on Adam vs SGD.
"""

# Set a small limit to demonstrate the split
chunks = chunk_markdown_with_limit(markdown_input, max_chunk_size=50)

for chunk in chunks:
    print(f"Headers: {' > '.join(chunk['metadata']['headers'])}")
    if chunk['metadata'].get('is_split'):
        print(f"Part: {chunk['metadata']['sub_index'] + 1}/{chunk['metadata']['total_subs']}")
    print(f"Text: {chunk['content']}\n{'-'*30}")

Key Enhancements in this Version:

  1. Context-Aware Metadata: Every sub-chunk retains the full header_stack. Even if a piece of text is separated from its header, it knows it belongs to # Deep Learning Guide > ## Neural Networks.

  2. Overlap Support: The overlap parameter ensures that the end of one chunk is repeated at the start of the next. This prevents "sentence decapitation" where a model might lose the meaning of a thought because it was cut in half.

  3. Split Indicators: Metadata includes sub_index and total_subs, which is extremely helpful for debugging or for reconstructing the document later.

  4. Logical Breaks: The length splitter looks for \n or spaces at the end of the window to avoid cutting words in the middle.

Pro-Tip for RAG:

When searching through these chunks, you can reconstruct the context for the LLM like this:

Document Context: Section: {headers} (Part {sub_index}) Content: {content}


Provide the follow-up Python code to embed into Chromadb

To integrate the chunking logic with ChromaDB, we will use the chromadb library. Since Chroma handles the heavy lifting of vector storage and similarity search, our job is to map our logical chunks and metadata into the format Chroma expects.

I’ll use the default embedding function (Sentence Transformers) for simplicity, but you can swap this for OpenAI or Cohere easily.

Python Implementation: ChromaDB Integration

First, ensure you have the library: pip install chromadb

import chromadb
from chromadb.utils import embedding_functions

# 1. Initialize Chroma Client (Persistent storage)
client = chromadb.PersistentClient(path="./my_markdown_db")

# 2. Define an embedding function (Default: all-MiniLM-L6-v2)
default_ef = embedding_functions.DefaultEmbeddingFunction()

# 3. Create or Get a Collection
collection = client.get_or_create_collection(
    name="markdown_knowledge_base",
    embedding_function=default_ef
)

def add_markdown_to_chroma(markdown_text, doc_id_prefix="doc_001"):
    # Reuse the chunking function from the previous step
    # (Assuming chunk_markdown_with_limit is defined in your script)
    chunks = chunk_markdown_with_limit(markdown_text, max_chunk_size=500)
    
    ids = []
    documents = []
    metadatas = []

    for i, chunk in enumerate(chunks):
        # Create a unique ID for every chunk
        chunk_id = f"{doc_id_prefix}_chunk_{i}"
        
        # Format metadata: Chroma requires flat dictionaries (no lists)
        # We join the header list into a single string
        metadata = {
            "source_id": doc_id_prefix,
            "headers": " > ".join(chunk["metadata"]["headers"]),
            "is_split": str(chunk["metadata"].get("is_split", False)),
            "sub_index": chunk["metadata"].get("sub_index", 0)
        }
        
        ids.append(chunk_id)
        documents.append(chunk["content"])
        metadatas.append(metadata)

    # Upsert into Chroma
    collection.upsert(
        ids=ids,
        documents=documents,
        metadatas=metadatas
    )
    print(f"Successfully embedded {len(chunks)} chunks.")

# --- Example Usage ---
markdown_content = """
# Neural Architecture
## Transformers
The transformer architecture relies on self-attention mechanisms.
## CNNs
Convolutional layers are excellent for spatial data like images.
"""

add_markdown_to_chroma(markdown_content, doc_id_prefix="ai_guide_v1")

# --- Querying the Database ---
results = collection.query(
    query_texts=["How do transformers work?"],
    n_results=2
)

print("\nTop Query Result:")
print(f"Content: {results['documents'][0][0]}")
print(f"Metadata: {results['metadatas'][0][0]}")

Important Considerations for ChromaDB

  • Metadata Flattening: ChromaDB metadata values must be strings, integers, floats, or booleans. Since our headers were a list, I used " > ".join() to flatten them into a single searchable string.

  • Upsert vs Add: I used upsert. This prevents duplicate entries if you run the script multiple times on the same document; it will simply update the existing record if the ID matches.

  • Persistence: By using PersistentClient, your embeddings are saved to the ./my_markdown_db folder. You won't lose your data when the script ends.

Pro-Tip: Filtering by Section

One of the coolest things about logical chunking is that you can now filter your search by a specific header. For example:

# Only search within the "CNNs" section
filtered_results = collection.query(
    query_texts=["spatial data"],
    where={"headers": {"$contains": "CNNs"}}, 
    n_results=1
)

Would you like me to show you how to swap the default embeddings for OpenAI's text-embedding-3-small instead?


What is the vector dimensions of all-MiniLM-L6-v2?

The all-MiniLM-L6-v2 model has a vector dimension of 384.

It is widely considered the "industry standard" for lightweight applications because it strikes an excellent balance between performance and efficiency.

Quick Stats for all-MiniLM-L6-v2:

  • Dimensions: 384

  • Max Sequence Length: 256 tokens (roughly 150–200 words)

  • Model Size: ~80 MB

  • Speed: Extremely fast (optimized for CPU and edge devices)

Comparison with Alternatives

If you find that 384 dimensions aren't capturing enough nuance for complex technical documentation, you might consider these common alternatives:

Model NameDimensionsMax TokensBest Use Case
all-MiniLM-L6-v2384256Speed, low memory, basic RAG.
all-mpnet-base-v2768384Higher accuracy, better semantic drift handling.
text-embedding-3-small (OpenAI)15368192Large context windows, state-of-the-art retrieval.

Pro-Tip for ChromaDB

If you decide to change models later, remember that vector databases are dimension-specific. If you switch from MiniLM (384) to MPNet (768), you must create a new collection; you cannot store vectors of different lengths in the same collection.

Would you like me to show you how to verify the dimensions directly in your Python code using the sentence-transformers library?


How do I use OpenAI's text-embedding-3-small instead?

To use OpenAI's text-embedding-3-small, you just need to swap the embedding function in ChromaDB.

Unlike the local model we used previously, this requires an API Key and the openai Python library. This model defaults to 1536 dimensions, offering significantly more "semantic space" than the 384 dimensions of the MiniLM model.

1. Install Dependencies

pip install openai chromadb

2. Updated ChromaDB Implementation

You will replace the DefaultEmbeddingFunction with OpenAIEmbeddingFunction.

import os
import chromadb
from chromadb.utils import embedding_functions

# 1. Set your API Key (Best practice: use environment variables)
# os.environ["OPENAI_API_KEY"] = "your-sk-..." 

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="text-embedding-3-small"
)

# 2. Initialize Chroma
client = chromadb.PersistentClient(path="./openai_markdown_db")

# 3. Create Collection with the OpenAI Function
# Note: If you already have a collection with 384 dimensions, 
# you MUST give this one a new name because dimensions will be 1536.
collection = client.get_or_create_collection(
    name="markdown_openai_v3",
    embedding_function=openai_ef
)

# Now, when you call collection.add(), Chroma automatically 
# sends your text to OpenAI's servers to get the 1536-dim vectors.
def add_to_db(chunks, doc_id):
    collection.upsert(
        ids=[f"{doc_id}_{i}" for i in range(len(chunks))],
        documents=[c["content"] for c in chunks],
        metadatas=[{
            "headers": " > ".join(c["metadata"]["headers"]),
            "is_split": str(c["metadata"].get("is_split", False))
        } for c in chunks]
    )

Why use text-embedding-3-small?

Featuretext-embedding-3-smallall-MiniLM-L6-v2
Dimensions1536384
Max Context8192 tokens256 tokens
Cost$0.02 per 1M tokensFree (Local)
MatryoshkaYes (Can be shortened)No

Pro-Tip: The "Matryoshka" Trick

One unique feature of OpenAI’s v3 models is that they are trained to store the most important information in the first few dimensions. If you want to save storage space but still use a high-quality model, you can explicitly set a lower dimension (like 512) and it will still be highly accurate:

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="text-embedding-3-small",
    dimensions=512  # Shrinks the vector while keeping it "smart"
)

Would you like me to show you how to set up a quick search function to retrieve these chunks by header name?


What are other high-dimension, large token count embedding functions available that do not require an API key?

In 2026, the open-source landscape for "Long-Context" embeddings has advanced significantly. You no longer need an API key to get OpenAI-level performance ($1536+$ dimensions and $8k+$ token windows).

The models below are the current "heavy hitters" for local RAG (Retrieval-Augmented Generation) because they handle entire document sections without needing aggressive chunking.

1. The High-Performance Leader: Qwen3-Embedding-8B

Released by Alibaba’s Qwen team, this is currently the top-ranked open-weights model on the MTEB (Massive Text Embedding Benchmark).

  • Dimensions: Variable (up to 7,168)

  • Token Count: 32,768 tokens (~25,000 words)

  • Best For: When accuracy is your only priority and you have a GPU (requires ~16GB VRAM for smooth local inference).

  • Unique Feature: It uses Matryoshka Learning, meaning you can truncate the 7,168-dimensional vector down to 1,024 without losing much accuracy.

2. The Efficiency King: BAAI BGE-M3

"M3" stands for Multi-lingual, Multi-granular, and Multi-functional. It is the industry standard for versatile local deployments.

  • Dimensions: 1,024

  • Token Count: 8,192 tokens

  • Best For: Multilingual documents (supports 100+ languages) and hybrid search (combines dense and sparse vectors).

  • Unique Feature: It performs "All-in-One" retrieval, meaning it can handle short sentences and long documents with the same model.

3. The Balanced Workhorse: Jina-Embeddings-V3

Jina AI’s newest open-weights model is specifically tuned for RAG and technical documentation.

  • Dimensions: 1,024

  • Token Count: 8,192 tokens

  • Best For: Technical documentation and code.

  • Unique Feature: It includes "Task-Specific Adapters." When you embed, you can tell the model if the text is a query, a passage, or code, which significantly improves retrieval precision.

4. The Lightweight Champion: Nomic-Embed-Text-v1.5

If you need high dimensions but have limited hardware (like a laptop), Nomic is the way to go.

  • Dimensions: 768 (but highly dense)

  • Token Count: 8,192 tokens

  • Best For: High-speed indexing on consumer hardware.

  • Unique Feature: It is fully auditable and open-source (data + training code), which is rare for models of this quality.


Comparison Table (Local Models)

ModelDimensionsMax TokensHardware RequirementLicense
Qwen3-Embedding-8B7,16832kHigh (GPU)Apache 2.0
BGE-M31,0248kMedium (CPU/GPU)MIT
Jina-v31,0248kMedium (CPU/GPU)Apache 2.0
Nomic-v1.57688kLow (CPU)Apache 2.0

How to use them in Python

Since these are hosted on Hugging Face, you can use them in your ChromaDB setup by switching the embedding function to SentenceTransformerEmbeddingFunction:

from chromadb.utils import embedding_functions

# Example using BGE-M3 (8k context, 1024 dimensions)
local_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="BAAI/bge-m3"
)

# Use this 'local_ef' when creating your Chroma collection
collection = client.get_or_create_collection(
    name="long_context_local_db",
    embedding_function=local_ef
)

Would you like me to adjust your chunking script to utilize the larger 8k token window of these models?


Provide the code using Nomic embedding function

To use the Nomic Embed Text v1.5 model locally (without an API key), we utilize the sentence-transformers library. This model is a significant upgrade over the default because it supports a massive 8,192 token context window and provides 768 dimensions.

1. Install Dependencies

You'll need sentence-transformers and einops (which Nomic uses for its efficient architecture).

pip install chromadb sentence-transformers einops

2. Implementation with Metadata & Logical Chunking

We will wrap the Nomic model into a ChromaDB-compatible EmbeddingFunction. Note that Nomic requires a prefix (search_document: ) for the text being stored to achieve maximum accuracy.

import os
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer

# 1. Define a Custom Nomic Embedding Function
class NomicEmbeddingFunction(chromadb.EmbeddingFunction):
    def __init__(self, model_name="nomic-ai/nomic-embed-text-v1.5"):
        # trust_remote_code is required for Nomic's architecture
        self.model = SentenceTransformer(model_name, trust_remote_code=True)
        
    def __call__(self, input: chromadb.Documents) -> chromadb.Embeddings:
        # Nomic v1.5 expects a "search_document: " prefix for indexing
        prefixed_texts = [f"search_document: {text}" for text in input]
        embeddings = self.model.encode(prefixed_texts, convert_to_tensor=False)
        return embeddings.tolist()

# 2. Initialize Chroma and the Nomic Function
client = chromadb.PersistentClient(path="./nomic_local_db")
nomic_ef = NomicEmbeddingFunction()

collection = client.get_or_create_collection(
    name="nomic_markdown_docs",
    embedding_function=nomic_ef
)

# 3. Add chunks (Reusing your previous chunking logic)
def add_markdown_to_nomic_db(markdown_text, doc_id):
    # Since Nomic handles 8k tokens, we can use much larger max_chunk_size
    chunks = chunk_markdown_with_limit(markdown_text, max_chunk_size=2000)
    
    collection.upsert(
        ids=[f"{doc_id}_{i}" for i in range(len(chunks))],
        documents=[c["content"] for c in chunks],
        metadatas=[{
            "headers": " > ".join(c["metadata"]["headers"]),
            "is_split": str(c["metadata"].get("is_split", False))
        } for c in chunks]
    )

# 4. Querying (Important: Queries need a different prefix!)
def query_nomic(query_text, n_results=3):
    # Manual prefixing for the query
    query_with_prefix = f"search_query: {query_text}"
    
    results = collection.query(
        query_texts=[query_with_prefix], 
        n_results=n_results
    )
    return results

# --- Example Run ---
markdown_data = "# Research Paper\n## Methodology\nDetailed local embedding logic..."
add_markdown_to_nomic_db(markdown_data, "paper_001")

res = query_nomic("How are local embeddings handled?")
print(res['documents'][0][0])

Why Nomic v1.5 is unique

  1. Prefix Requirement: It uses "task-specific" prefixes. Use search_document: for data you are storing and search_query: for questions you are asking.

  2. Matryoshka Embeddings: While it defaults to 768 dimensions, it is designed so you can truncate the vector to 256 or 128 dimensions to save database space while keeping ~90% of the accuracy.

  3. Local Context: Because it handles 8,192 tokens, you can often store entire Markdown sub-sections as a single chunk, which keeps the context much cleaner for the LLM later.

Pro-Tip: Hardware Acceleration

The first time you run this, it will download about 500MB of model weights. If you have a GPU, sentence-transformers will automatically detect it (CUDA or Mac Metal) and speed up the embedding process significantly.

Would you like me to show you how to implement the "Matryoshka" truncation to reduce your database size?


From <https://gemini.google.com/app/>  Google Gemini 3.0 (Fast)

Comments

Popular posts from this blog

#2892 Example of ChatGPT o1 Prompt

#3105 Federal and State Spending Idiocy