#4296 Chunking Markdown; +Embedding with Nomic and ChromaDB
#4296 Chunking Markdown; +Embedding with Nomic and ChromaDB
#4296 Provide Python code to parse markdown text for logical chunking, with metadata for each chunk
To handle "logical chunking," the most effective approach is to parse Markdown by its structure (headers) rather than just character counts. This ensures that a section about "Installation" stays together rather than being sliced in half.
We can use the markdown-it-py library because it’s fast and provides a clear Abstract Syntax Tree (AST), making it easy to track which header a piece of text belongs to.
Python Implementation: Header-Based Chunker
First, install the library:
pip install markdown-it-py
from markdown_it import MarkdownIt
def chunk_markdown(text):
md = MarkdownIt()
tokens = md.parse(text)
chunks = []
current_chunk = {"content": [], "metadata": {"headers": []}}
header_stack = []
for i, token in enumerate(tokens):
# Handle Headers
if token.type == "heading_open":
# If we already have content, save the previous chunk
if current_chunk["content"]:
current_chunk["content"] = "\n".join(current_chunk["content"]).strip()
chunks.append(current_chunk)
# Get header level and content
level = int(token.tag[1])
content_token = tokens[i + 1]
header_text = content_token.content
# Manage header hierarchy for metadata
header_stack = header_stack[:level - 1]
header_stack.append(header_text)
# Start new chunk
current_chunk = {
"content": [f"{'#' * level} {header_text}"],
"metadata": {
"headers": list(header_stack),
"level": level,
"title": header_text
}
}
# Handle Paragraphs and Lists
elif token.type in ["paragraph_open", "bullet_list_open", "ordered_list_open"]:
# Logic to find the inline content following the open tag
j = i + 1
while j < len(tokens) and tokens[j].type != token.type.replace("_open", "_close"):
if tokens[j].type == "inline":
current_chunk["content"].append(tokens[j].content)
j += 1
# Add the final chunk
if current_chunk["content"]:
current_chunk["content"] = "\n".join(current_chunk["content"]).strip()
chunks.append(current_chunk)
return chunks
# --- Example Usage ---
markdown_input = """
# Project Alpha
This is the intro.
## Installation
1. Run pip install.
2. Configure settings.
## Usage
Refer to the docs.
"""
result = chunk_markdown(markdown_input)
for chunk in result:
print(f"Metadata: {chunk['metadata']}")
print(f"Content: {chunk['content']}\n{'-'*20}")
Why this approach works
Context Preservation: Each chunk knows its "breadcrumb" path (e.g.,
Project Alpha > Installation). This is vital for RAG (Retrieval-Augmented Generation) applications.Granularity: It splits at every header. If your sections are too long, you can add a secondary "Length Splitter" inside each logical chunk.
Metadata Richness: You can easily extend the
metadatadictionary to include line numbers, word counts, or even extracted keywords.
Refinement for Large Sections
If a single header contains 5,000 words, logical chunking by header alone might exceed your LLM's context window. In that case, you'd want to wrap the output of the function above in a Recursive Character Splitter.
Note: If you are building a production RAG pipeline, libraries like
LangChainorLlamaIndexhaveMarkdownHeaderTextSplitterclasses built-in that perform this exact logic.
Would you like me to add a secondary "token-count" limit to this code so it splits extra-long sections while keeping the metadata?
Yes
To implement a secondary splitting mechanism (by length) while preserving logical metadata, we can combine header-based parsing with a recursive text splitter.
This approach ensures that if a single section (e.g., a long "Introduction") exceeds your token/character limit, it is broken down into smaller pieces that still "know" they belong to the "Introduction" section of "Project Alpha."
Python Implementation
This script uses markdown-it-py for structure and a custom recursive function for length management.
from markdown_it import MarkdownIt
def split_text_by_length(text, max_length, overlap=100):
"""
Splits a long string into smaller chunks of max_length,
attempting to split at newlines or spaces.
"""
chunks = []
start = 0
while start < len(text):
end = start + max_length
if end >= len(text):
chunks.append(text[start:])
break
# Try to find a logical break point (newline or space) within the last 20% of the window
break_point = text.rfind('\n', start + int(max_length * 0.8), end)
if break_point == -1:
break_point = text.rfind(' ', start + int(max_length * 0.8), end)
if break_point != -1:
end = break_point
chunks.append(text[start:end].strip())
start = end - overlap # Backtrack for overlap context
return chunks
def chunk_markdown_with_limit(text, max_chunk_size=1000, overlap=100):
md = MarkdownIt()
tokens = md.parse(text)
logical_chunks = []
current_content = []
header_stack = []
# --- Phase 1: Logical Grouping ---
for i, token in enumerate(tokens):
if token.type == "heading_open":
# Save previous section if it exists
if current_content:
logical_chunks.append({
"content": "\n".join(current_content).strip(),
"headers": list(header_stack)
})
level = int(token.tag[1])
header_text = tokens[i + 1].content
header_stack = header_stack[:level - 1]
header_stack.append(header_text)
current_content = [f"{'#' * level} {header_text}"]
elif token.type in ["paragraph_open", "bullet_list_open", "ordered_list_open"]:
j = i + 1
while j < len(tokens) and tokens[j].type != token.type.replace("_open", "_close"):
if tokens[j].type == "inline":
current_content.append(tokens[j].content)
j += 1
# Add final section
if current_content:
logical_chunks.append({
"content": "\n".join(current_content).strip(),
"headers": list(header_stack)
})
# --- Phase 2: Length Splitting ---
final_chunks = []
for section in logical_chunks:
if len(section["content"]) > max_chunk_size:
sub_chunks = split_text_by_length(section["content"], max_chunk_size, overlap)
for idx, sub in enumerate(sub_chunks):
final_chunks.append({
"content": sub,
"metadata": {
**section,
"is_split": True,
"sub_index": idx,
"total_subs": len(sub_chunks),
"content": None # Clear raw section content from metadata
}
})
else:
final_chunks.append({
"content": section["content"],
"metadata": {
"headers": section["headers"],
"is_split": False
}
})
return final_chunks
# --- Example Usage ---
markdown_input = """
# Deep Learning Guide
## Neural Networks
A very long explanation of backpropagation... (imagine 5000 words here)
## Optimization
Discussion on Adam vs SGD.
"""
# Set a small limit to demonstrate the split
chunks = chunk_markdown_with_limit(markdown_input, max_chunk_size=50)
for chunk in chunks:
print(f"Headers: {' > '.join(chunk['metadata']['headers'])}")
if chunk['metadata'].get('is_split'):
print(f"Part: {chunk['metadata']['sub_index'] + 1}/{chunk['metadata']['total_subs']}")
print(f"Text: {chunk['content']}\n{'-'*30}")
Key Enhancements in this Version:
Context-Aware Metadata: Every sub-chunk retains the full
header_stack. Even if a piece of text is separated from its header, it knows it belongs to# Deep Learning Guide > ## Neural Networks.Overlap Support: The
overlapparameter ensures that the end of one chunk is repeated at the start of the next. This prevents "sentence decapitation" where a model might lose the meaning of a thought because it was cut in half.Split Indicators: Metadata includes
sub_indexandtotal_subs, which is extremely helpful for debugging or for reconstructing the document later.Logical Breaks: The length splitter looks for
\nor spaces at the end of the window to avoid cutting words in the middle.
Pro-Tip for RAG:
When searching through these chunks, you can reconstruct the context for the LLM like this:
Document Context: Section: {headers} (Part {sub_index}) Content: {content}
Provide the follow-up Python code to embed into Chromadb
To integrate the chunking logic with ChromaDB, we will use the chromadb library. Since Chroma handles the heavy lifting of vector storage and similarity search, our job is to map our logical chunks and metadata into the format Chroma expects.
I’ll use the default embedding function (Sentence Transformers) for simplicity, but you can swap this for OpenAI or Cohere easily.
Python Implementation: ChromaDB Integration
First, ensure you have the library:
pip install chromadb
import chromadb
from chromadb.utils import embedding_functions
# 1. Initialize Chroma Client (Persistent storage)
client = chromadb.PersistentClient(path="./my_markdown_db")
# 2. Define an embedding function (Default: all-MiniLM-L6-v2)
default_ef = embedding_functions.DefaultEmbeddingFunction()
# 3. Create or Get a Collection
collection = client.get_or_create_collection(
name="markdown_knowledge_base",
embedding_function=default_ef
)
def add_markdown_to_chroma(markdown_text, doc_id_prefix="doc_001"):
# Reuse the chunking function from the previous step
# (Assuming chunk_markdown_with_limit is defined in your script)
chunks = chunk_markdown_with_limit(markdown_text, max_chunk_size=500)
ids = []
documents = []
metadatas = []
for i, chunk in enumerate(chunks):
# Create a unique ID for every chunk
chunk_id = f"{doc_id_prefix}_chunk_{i}"
# Format metadata: Chroma requires flat dictionaries (no lists)
# We join the header list into a single string
metadata = {
"source_id": doc_id_prefix,
"headers": " > ".join(chunk["metadata"]["headers"]),
"is_split": str(chunk["metadata"].get("is_split", False)),
"sub_index": chunk["metadata"].get("sub_index", 0)
}
ids.append(chunk_id)
documents.append(chunk["content"])
metadatas.append(metadata)
# Upsert into Chroma
collection.upsert(
ids=ids,
documents=documents,
metadatas=metadatas
)
print(f"Successfully embedded {len(chunks)} chunks.")
# --- Example Usage ---
markdown_content = """
# Neural Architecture
## Transformers
The transformer architecture relies on self-attention mechanisms.
## CNNs
Convolutional layers are excellent for spatial data like images.
"""
add_markdown_to_chroma(markdown_content, doc_id_prefix="ai_guide_v1")
# --- Querying the Database ---
results = collection.query(
query_texts=["How do transformers work?"],
n_results=2
)
print("\nTop Query Result:")
print(f"Content: {results['documents'][0][0]}")
print(f"Metadata: {results['metadatas'][0][0]}")
Important Considerations for ChromaDB
Metadata Flattening: ChromaDB metadata values must be strings, integers, floats, or booleans. Since our
headerswere a list, I used" > ".join()to flatten them into a single searchable string.Upsert vs Add: I used
upsert. This prevents duplicate entries if you run the script multiple times on the same document; it will simply update the existing record if the ID matches.Persistence: By using
PersistentClient, your embeddings are saved to the./my_markdown_dbfolder. You won't lose your data when the script ends.
Pro-Tip: Filtering by Section
One of the coolest things about logical chunking is that you can now filter your search by a specific header. For example:
# Only search within the "CNNs" section
filtered_results = collection.query(
query_texts=["spatial data"],
where={"headers": {"$contains": "CNNs"}},
n_results=1
)
Would you like me to show you how to swap the default embeddings for OpenAI's text-embedding-3-small instead?
What is the vector dimensions of all-MiniLM-L6-v2?
The all-MiniLM-L6-v2 model has a vector dimension of 384.
It is widely considered the "industry standard" for lightweight applications because it strikes an excellent balance between performance and efficiency.
Quick Stats for all-MiniLM-L6-v2:
Dimensions: 384
Max Sequence Length: 256 tokens (roughly 150–200 words)
Model Size: ~80 MB
Speed: Extremely fast (optimized for CPU and edge devices)
Comparison with Alternatives
If you find that 384 dimensions aren't capturing enough nuance for complex technical documentation, you might consider these common alternatives:
| Model Name | Dimensions | Max Tokens | Best Use Case |
| all-MiniLM-L6-v2 | 384 | 256 | Speed, low memory, basic RAG. |
| all-mpnet-base-v2 | 768 | 384 | Higher accuracy, better semantic drift handling. |
| text-embedding-3-small (OpenAI) | 1536 | 8192 | Large context windows, state-of-the-art retrieval. |
Pro-Tip for ChromaDB
If you decide to change models later, remember that vector databases are dimension-specific. If you switch from MiniLM (384) to MPNet (768), you must create a new collection; you cannot store vectors of different lengths in the same collection.
Would you like me to show you how to verify the dimensions directly in your Python code using the sentence-transformers library?
How do I use OpenAI's text-embedding-3-small instead?
To use OpenAI's text-embedding-3-small, you just need to swap the embedding function in ChromaDB.
Unlike the local model we used previously, this requires an API Key and the openai Python library. This model defaults to 1536 dimensions, offering significantly more "semantic space" than the 384 dimensions of the MiniLM model.
1. Install Dependencies
pip install openai chromadb
2. Updated ChromaDB Implementation
You will replace the DefaultEmbeddingFunction with OpenAIEmbeddingFunction.
import os
import chromadb
from chromadb.utils import embedding_functions
# 1. Set your API Key (Best practice: use environment variables)
# os.environ["OPENAI_API_KEY"] = "your-sk-..."
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.getenv("OPENAI_API_KEY"),
model_name="text-embedding-3-small"
)
# 2. Initialize Chroma
client = chromadb.PersistentClient(path="./openai_markdown_db")
# 3. Create Collection with the OpenAI Function
# Note: If you already have a collection with 384 dimensions,
# you MUST give this one a new name because dimensions will be 1536.
collection = client.get_or_create_collection(
name="markdown_openai_v3",
embedding_function=openai_ef
)
# Now, when you call collection.add(), Chroma automatically
# sends your text to OpenAI's servers to get the 1536-dim vectors.
def add_to_db(chunks, doc_id):
collection.upsert(
ids=[f"{doc_id}_{i}" for i in range(len(chunks))],
documents=[c["content"] for c in chunks],
metadatas=[{
"headers": " > ".join(c["metadata"]["headers"]),
"is_split": str(c["metadata"].get("is_split", False))
} for c in chunks]
)
Why use text-embedding-3-small?
| Feature | text-embedding-3-small | all-MiniLM-L6-v2 |
| Dimensions | 1536 | 384 |
| Max Context | 8192 tokens | 256 tokens |
| Cost | $0.02 per 1M tokens | Free (Local) |
| Matryoshka | Yes (Can be shortened) | No |
Pro-Tip: The "Matryoshka" Trick
One unique feature of OpenAI’s v3 models is that they are trained to store the most important information in the first few dimensions. If you want to save storage space but still use a high-quality model, you can explicitly set a lower dimension (like 512) and it will still be highly accurate:
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.getenv("OPENAI_API_KEY"),
model_name="text-embedding-3-small",
dimensions=512 # Shrinks the vector while keeping it "smart"
)
Would you like me to show you how to set up a quick search function to retrieve these chunks by header name?
What are other high-dimension, large token count embedding functions available that do not require an API key?
In 2026, the open-source landscape for "Long-Context" embeddings has advanced significantly. You no longer need an API key to get OpenAI-level performance ($1536+$ dimensions and $8k+$ token windows).
The models below are the current "heavy hitters" for local RAG (Retrieval-Augmented Generation) because they handle entire document sections without needing aggressive chunking.
1. The High-Performance Leader: Qwen3-Embedding-8B
Released by Alibaba’s Qwen team, this is currently the top-ranked open-weights model on the MTEB (Massive Text Embedding Benchmark).
Dimensions: Variable (up to 7,168)
Token Count: 32,768 tokens (~25,000 words)
Best For: When accuracy is your only priority and you have a GPU (requires ~16GB VRAM for smooth local inference).
Unique Feature: It uses Matryoshka Learning, meaning you can truncate the 7,168-dimensional vector down to 1,024 without losing much accuracy.
2. The Efficiency King: BAAI BGE-M3
"M3" stands for Multi-lingual, Multi-granular, and Multi-functional.
Dimensions: 1,024
Token Count: 8,192 tokens
Best For: Multilingual documents (supports 100+ languages) and hybrid search (combines dense and sparse vectors).
Unique Feature: It performs "All-in-One" retrieval, meaning it can handle short sentences and long documents with the same model.
3. The Balanced Workhorse: Jina-Embeddings-V3
Jina AI’s newest open-weights model is specifically tuned for RAG and technical documentation.
Dimensions: 1,024
Token Count: 8,192 tokens
Best For: Technical documentation and code.
Unique Feature: It includes "Task-Specific Adapters."
When you embed, you can tell the model if the text is a query, apassage, orcode, which significantly improves retrieval precision.
4. The Lightweight Champion: Nomic-Embed-Text-v1.5
If you need high dimensions but have limited hardware (like a laptop), Nomic is the way to go.
Dimensions: 768 (but highly dense)
Token Count: 8,192 tokens
Best For: High-speed indexing on consumer hardware.
Unique Feature: It is fully auditable and open-source (data + training code), which is rare for models of this quality.
Comparison Table (Local Models)
| Model | Dimensions | Max Tokens | Hardware Requirement | License |
| Qwen3-Embedding-8B | 7,168 | 32k | High (GPU) | Apache 2.0 |
| BGE-M3 | 1,024 | 8k | Medium (CPU/GPU) | MIT |
| Jina-v3 | 1,024 | 8k | Medium (CPU/GPU) | Apache 2.0 |
| Nomic-v1.5 | 768 | 8k | Low (CPU) | Apache 2.0 |
How to use them in Python
Since these are hosted on Hugging Face, you can use them in your ChromaDB setup by switching the embedding function to SentenceTransformerEmbeddingFunction:
from chromadb.utils import embedding_functions
# Example using BGE-M3 (8k context, 1024 dimensions)
local_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="BAAI/bge-m3"
)
# Use this 'local_ef' when creating your Chroma collection
collection = client.get_or_create_collection(
name="long_context_local_db",
embedding_function=local_ef
)
Would you like me to adjust your chunking script to utilize the larger 8k token window of these models?
Provide the code using Nomic embedding function
To use the Nomic Embed Text v1.5 model locally (without an API key), we utilize the sentence-transformers library. This model is a significant upgrade over the default because it supports a massive 8,192 token context window and provides 768 dimensions.
1. Install Dependencies
You'll need sentence-transformers and einops (which Nomic uses for its efficient architecture).
pip install chromadb sentence-transformers einops
2. Implementation with Metadata & Logical Chunking
We will wrap the Nomic model into a ChromaDB-compatible EmbeddingFunction. Note that Nomic requires a prefix (search_document: ) for the text being stored to achieve maximum accuracy.
import os
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
# 1. Define a Custom Nomic Embedding Function
class NomicEmbeddingFunction(chromadb.EmbeddingFunction):
def __init__(self, model_name="nomic-ai/nomic-embed-text-v1.5"):
# trust_remote_code is required for Nomic's architecture
self.model = SentenceTransformer(model_name, trust_remote_code=True)
def __call__(self, input: chromadb.Documents) -> chromadb.Embeddings:
# Nomic v1.5 expects a "search_document: " prefix for indexing
prefixed_texts = [f"search_document: {text}" for text in input]
embeddings = self.model.encode(prefixed_texts, convert_to_tensor=False)
return embeddings.tolist()
# 2. Initialize Chroma and the Nomic Function
client = chromadb.PersistentClient(path="./nomic_local_db")
nomic_ef = NomicEmbeddingFunction()
collection = client.get_or_create_collection(
name="nomic_markdown_docs",
embedding_function=nomic_ef
)
# 3. Add chunks (Reusing your previous chunking logic)
def add_markdown_to_nomic_db(markdown_text, doc_id):
# Since Nomic handles 8k tokens, we can use much larger max_chunk_size
chunks = chunk_markdown_with_limit(markdown_text, max_chunk_size=2000)
collection.upsert(
ids=[f"{doc_id}_{i}" for i in range(len(chunks))],
documents=[c["content"] for c in chunks],
metadatas=[{
"headers": " > ".join(c["metadata"]["headers"]),
"is_split": str(c["metadata"].get("is_split", False))
} for c in chunks]
)
# 4. Querying (Important: Queries need a different prefix!)
def query_nomic(query_text, n_results=3):
# Manual prefixing for the query
query_with_prefix = f"search_query: {query_text}"
results = collection.query(
query_texts=[query_with_prefix],
n_results=n_results
)
return results
# --- Example Run ---
markdown_data = "# Research Paper\n## Methodology\nDetailed local embedding logic..."
add_markdown_to_nomic_db(markdown_data, "paper_001")
res = query_nomic("How are local embeddings handled?")
print(res['documents'][0][0])
Why Nomic v1.5 is unique
Prefix Requirement: It uses "task-specific" prefixes. Use
search_document:for data you are storing andsearch_query:for questions you are asking.Matryoshka Embeddings: While it defaults to 768 dimensions, it is designed so you can truncate the vector to 256 or 128 dimensions to save database space while keeping ~90% of the accuracy.
Local Context: Because it handles 8,192 tokens, you can often store entire Markdown sub-sections as a single chunk, which keeps the context much cleaner for the LLM later.
Pro-Tip: Hardware Acceleration
The first time you run this, it will download about 500MB of model weights. If you have a GPU, sentence-transformers will automatically detect it (CUDA or Mac Metal) and speed up the embedding process significantly.
Would you like me to show you how to implement the "Matryoshka" truncation to reduce your database size?
From <https://gemini.google.com/app/> Google Gemini 3.0 (Fast)
Comments
Post a Comment