Day 11 -Testing Chunking Strategies in RAG

AI Engineering — Day by Day
My journey to becoming an AI Engineer

In my previous post, I explored chunking conceptually and realized something important:

Chunking is not just preprocessing — it directly affects retrieval quality.

But this time, I wanted to go beyond theory.

I wanted to actually test:

How different chunking strategies behave
How retrieval scores change with each strategy
Why some approaches fail and what that looks like in real numbers
How embeddings work under the hood
How to set up HuggingFace models locally

So I built a small experiment pipeline locally — and what I saw completely changed how I think about RAG systems.

The Goal of the Experiment

The idea was simple — build a mini retrieval pipeline from scratch:

Document
   ↓
Chunking Strategy (split text into pieces)
   ↓
Embeddings (convert each chunk into numbers)
   ↓
Similarity Search (compare query numbers to chunk numbers)
   ↓
Retrieved Chunk (best matching piece)

Instead of blindly assuming one strategy is better, I wanted to observe retrieval behavior directly — with actual cosine similarity scores.

Setting Up the Environment

Step 1 — Install Required Libraries

pip install sentence-transformers scikit-learn numpy

What each library does:

Library	Purpose
`sentence-transformers`	Loads pre-trained embedding models from HuggingFace and converts text → numerical vectors
`scikit-learn`	Provides `cosine_similarity` function to compare vectors
`numpy`	Handles numerical array operations under the hood

Step 2 — Understanding HuggingFace Hub

This was new to me, so let me explain what I learned:

HuggingFace Hub is like a "GitHub for AI models." Thousands of pre-trained models are hosted there for free. When you write:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

Here's what happens step by step:

The library connects to huggingface.co
Downloads the model files (~90 MB for MiniLM) to your local cache folder
Loads the model into memory so you can use it
Next time you run it, it loads from the local cache instantly — no download needed

Step 3 — Handling the HuggingFace Authentication Warning

When I first ran the code, I got this warning:

Warning: You are sending unauthenticated requests to the HF Hub.
Please set a HF_TOKEN to enable higher rate limits and faster downloads.

This happens because HuggingFace lets you download public models without logging in (unauthenticated), but with restrictions — slower download speeds and lower request limits. My first download attempt actually got stuck at 0% because of this rate limiting.

To fix it (optional but recommended):

Go to huggingface.co → Create a free account
Go to Settings → Access Tokens → Create a new token (read access is enough)
Set it as an environment variable (run once in PowerShell/Terminal):

OS	Command
Windows (PowerShell)	`[System.Environment]::SetEnvironmentVariable("HF_TOKEN", "hf_your_token_here", "User")`
Linux / macOS	Add `export HF_TOKEN="hf_your_token_here"` to `~/.bashrc` or `~/.zshrc`

Understanding SentenceTransformer Models

This was one of my biggest learning moments. There are thousands of models on HuggingFace, but only a handful matter for most use cases.

What the Model Actually Does

model.encode(["Can I return a product?"])
→ Returns: [0.023, -0.156, 0.089, ...]  (384 numbers)

The model converts text into a 384-dimensional number array (called an embedding or vector). Texts with similar meanings produce similar numbers. Then cosine_similarity compares these number arrays — the higher the score (closer to 1.0), the more semantically similar the texts are.

What the Model Name Means

all-MiniLM-L6-v2
│   │       │  └─ version 2
│   │       └──── 6 transformer layers (L12 = 12 layers = deeper)
│   └──────────── MiniLM architecture (small and fast)
└──────────────── trained on "all" datasets (general purpose)

Popular Models — When to Use Which

Model	Size	Dimensions	Speed	Quality	Best For
all-MiniLM-L6-v2	90 MB	384	Fastest	Good	Learning, prototyping, small apps
all-MiniLM-L12-v2	130 MB	384	Fast	Better	When you need a quality bump
all-mpnet-base-v2	420 MB	768	Medium	Best (old gen)	Production with good hardware
BAAI/bge-small-en-v1.5	130 MB	384	Fast	Very Good	Modern alternative to MiniLM
BAAI/bge-base-en-v1.5	440 MB	768	Medium	Excellent	Production-grade retrieval
BAAI/bge-large-en-v1.5	1.3 GB	1024	Slow	Top tier	Maximum quality, needs GPU

Thumb Rules for Choosing a Model

Rule 1 — Match model to your hardware:

CPU only (no GPU)  → Use "small" models (< 150 MB): all-MiniLM-L6-v2, bge-small-en-v1.5
GPU available      → Use "base" or "large" models: bge-base-en-v1.5, bge-large-en-v1.5

Rule 2 — Match model to your stage:

Learning / experimenting  → all-MiniLM-L6-v2
Building a prototype      → all-MiniLM-L6-v2 (still fine)
Production MVP            → bge-base-en-v1.5
Production at scale       → bge-large-en-v1.5 or OpenAI/Cohere API

Rule 3 — Match model to your language:

English only   → any "en" model (bge-base-en-v1.5)
Multilingual   → paraphrase-multilingual-MiniLM-L12-v2 or BAAI/bge-m3

Rule 4 — Match model to your data volume:

< 10K chunks    → Model size doesn't matter much
10K—100K chunks → Small/base model (speed matters now)
> 1M chunks     → Small model OR use an API (OpenAI, Cohere)

Rule 5 — The 80/20 rule:

all-MiniLM-L6-v2 handles 80% of use cases. Only upgrade when you've proven the retrieval quality is the bottleneck — not your chunks, not your prompts, not your data(This is what Opus 4.6 Suggested).

Decision Flowchart:

Start
  │
  ├─ Just learning? → all-MiniLM-L6-v2 ✅ STOP
  │
  ├─ Building something real?
  │    ├─ English only? → bge-base-en-v1.5
  │    └─ Multiple languages? → bge-m3
  │
  └─ Retrieval quality still bad after fixing chunks?
       ├─ Have GPU? → bge-large-en-v1.5
       └─ No GPU?  → Use OpenAI text-embedding-3-small API

And when you want to switch, it's just one line:

model = SentenceTransformer('BAAI/bge-base-en-v1.5')  # that's it — rest of code stays the same

The Test Document

I used a small support-style document containing four distinct policies:

Refund Policy:
Customers can return products within 7 days with receipt.

Shipping Policy:
Orders are shipped within 3 business days.

Account Setup:
Users must verify email before login.

Cancellation Policy:
Orders can be cancelled before shipping.

This document was intentionally small and had clear topic boundaries — perfect for observing how chunking strategies handle (or break) those boundaries.

The retrieval query I tested with:

Can I return a product?

The expected result: the system should retrieve the Refund Policy chunk, since "return a product" is semantically closest to "return products within 7 days with receipt."

The Chunking Experiments

I compared four approaches and measured actual cosine similarity scores for each:

Experiment	Strategy	Best Score	Result
Experiment 1	Semantic Chunking (paragraph-based)	0.6204	Best — clean match
Experiment A	Very Large Chunks (3 topics merged)	0.5186	Score diluted by noise
Experiment B	Tiny Chunks (sentence-by-sentence)	0.6159	Found sentence, lost heading
Experiment C	Overlap Chunking	0.5637	Better than large, not best

Experiment 1 — Semantic Chunking (Score: 0.6204)

This approach split content using paragraph boundaries (\n\n):

semantic_chunks = [
    chunk.strip()
    for chunk in document.split("\n\n")
    if chunk.strip()
]

This produced 4 clean chunks, one per policy:

Chunk 1: "Refund Policy:\nCustomers can return products within 7 days with receipt."
Chunk 2: "Shipping Policy:\nOrders are shipped within 3 business days."
Chunk 3: "Account Setup:\nUsers must verify email before login."
Chunk 4: "Cancellation Policy:\nOrders can be cancelled before shipping."

Each chunk contained:

One complete topic (heading + full detail together)
One coherent meaning with no noise from other topics
Natural topic boundaries preserved

Result: The highest retrieval score of 0.6204. The refund-related query matched the refund chunk accurately.

Full scores:

Chunk 1 (Refund):       0.6204 ← BEST
Chunk 2 (Shipping):     0.1692
Chunk 3 (Account):     -0.0397
Chunk 4 (Cancellation): 0.2943

Why this worked best:

The embedding model receives a coherent, complete thought. The resulting vector accurately represents one topic, so cosine similarity can precisely match it to the relevant query. No noise from other topics diluting the signal.

Experiment A — Very Large Chunks (Score: 0.5186)

I intentionally merged three policies into one giant chunk:

Chunk 1: "Refund Policy: ... + Shipping Policy: ... + Account Setup: ..."
Chunk 2: "Cancellation Policy: ..."

Result: Score dropped from 0.6204 to 0.5186 — a 16% reduction.

This created a fascinating failure:

The refund information existed inside the chunk
But the embedding became an "average meaning" of three unrelated topics
The vector partially matched the query, but weakly

Think of it this way:

It's like asking someone "What's your refund policy?" and they hand you a 3-page document containing refund, shipping, AND account setup. The answer is technically in there… but buried in noise.

This was one of the biggest insights for me — the embedding model doesn't "pick out" the relevant sentence from a large chunk. It creates one vector that represents the overall meaning of the entire chunk. When that chunk contains multiple unrelated topics, the vector becomes a diluted average that weakly matches everything but strongly matches nothing.

Larger chunks preserve context… but reduce retrieval precision.

Experiment B — Tiny Chunks (Score: 0.6159)

Then I went to the opposite extreme — splitting text line by line:

Chunk 1: "Refund Policy:"
Chunk 2: "Customers can return products within 7 days with receipt."
Chunk 3: "Shipping Policy:"
Chunk 4: "Orders are shipped within 3 business days."
... (8 chunks total)

Result: Chunk 2 scored 0.6159 — close to semantic chunking, but with a critical problem.

Full scores tell the story:

Chunk 1 (Refund Policy:):           0.4351
Chunk 2 (Customers can return...):  0.6159 ← BEST
Chunk 3 (Shipping Policy:):         0.2428
Chunk 4 (Orders are shipped...):    0.1821
Chunk 5 (Account Setup:):          -0.0352
Chunk 6 (Users must verify...):     0.0033
Chunk 7 (Cancellation Policy:):     0.2601
Chunk 8 (Orders can be cancelled):  0.3344

The problems:

Context fragmentation: "Refund Policy:" (the label) and "Customers can return products..." (the detail) became separate chunks
If the system returns only the best chunk, you get the sentence but lose the policy name
The heading "Refund Policy:" scored 0.4351 separately — decent but not the best match
Meaning was divorced — the label that gives context is split from the content it describes

Small chunks improve precision on individual sentences… but lose surrounding context and relationships between ideas. The heading and its explanation should stay together.

🔄 Experiment C — Overlap Chunking (Score: 0.5637)

Finally, I tested overlap chunking — where consecutive chunks share some text:

def overlap_chunk(text, chunk_size=100, overlap=30):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end].strip())
        start += chunk_size - overlap  # step back by overlap amount
    return [c for c in chunks if c]

Instead of creating completely separate chunks:

Without overlap:
  Chunk 1 → characters 1—100
  Chunk 2 → characters 101—200   (hard boundary)

With overlap (30 chars):
  Chunk 1 → characters 1—100
  Chunk 2 → characters 71—170    (30 chars shared)

Result: Score was 0.5637 — better than large chunks (0.5186) but not as good as semantic chunking (0.6204).

What overlap does well:

Preserves continuity across chunk boundaries
Reduces the chance of a sentence being cut in half
Important context that falls on a boundary appears in both chunks

What overlap doesn't solve:

Still uses character-based splitting — doesn't respect topic boundaries
Chunks can still mix unrelated topics
Creates redundant data (some text is stored and embedded twice)

Overlap is a good improvement over naive fixed chunking, but it's a workaround — not a solution. Semantic chunking avoids the problem entirely by splitting at natural boundaries.

Full Retrieval Score Comparison

Strategy	Best Score	Score vs Semantic	Observation
Semantic Chunking	0.6204	Baseline	Best balance of context + retrieval quality
Tiny Chunks	0.6159	-0.7%	Close precision but fragmented meaning — lost the heading
Overlap Chunks	0.5637	-9.1%	Improved continuity but still mixes topics at boundaries
Large Chunks	0.5186	-16.4%	Semantic dilution — 3 topics averaging out the signal

The Fixed Chunking Failure (Why We Didn't Even Test It for Retrieval)

Before the embedding experiments, I tested naive fixed chunking in chunking.py:

def fixed_chunk(text, chunk_size=80):
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append(text[i:i+chunk_size])
    return chunks

The output showed the fundamental problem:

Chunk 1: "Refund Policy:\nCustomers can return products within 7 days with re"
Chunk 2: "ceipt.\n\nShipping Policy:\nOrders are shipped within 3 business days."

Three failures happening at once:

Word destruction: "receipt" is split into "re" and "ceipt" — the word itself is broken
Topic mixing: Chunk 2 contains the tail of Refund + the start of Shipping — neither topic is complete
Incomplete sentences: Chunk 1 ends mid-sentence — the embedding model can't fully understand the refund policy

Context is damaged before embeddings even see it. No amount of good embedding or good LLM can fix chunks that were broken during splitting.

Some questions comes my mind and asked AI to Answer:

Q: Why do large chunks reduce retrieval quality?

A: Because embeddings represent the overall meaning of a chunk, not individual sentences within it. When you stuff 3 topics into one chunk, the embedding becomes a diluted average. It partially matches many queries but strongly matches none. In our experiment, the score dropped from 0.6204 to 0.5186 — a 16% loss.

Q: Why do small chunks reduce answer quality?

A: Because they divorce related information. "Refund Policy:" as a heading means nothing without "Customers can return products within 7 days with receipt." When they're separate chunks, the system might return the detail but lose the policy name — or vice versa. The retrieved answer lacks context.

Q: Why does overlap improve retrieval?

A: Overlap ensures that content near chunk boundaries appears in both adjacent chunks. So if an important sentence happens to fall on a boundary, it won't be cut in half — at least one chunk will contain the complete sentence. But overlap is a workaround for character-based splitting, not a fundamental solution.

Q: What's the tradeoff between chunk size and retrieval quality?

Too large → Score drops (0.5186) — unrelated topics dilute the embedding
Too small → Meaning splits (heading in one chunk, details in another)
Sweet spot → One coherent topic per chunk (0.6204) — maximum precision + full context

This is the Goldilocks problem of chunking:

You need chunks that are large enough to contain complete meaning, but small enough to represent a single focused topic.

Q: How does the embedding model convert text to numbers?
A:

Your text → SentenceTransformer model → Vector (array of 384 numbers)
                                              ↓
Query text → Same model → Vector → cosine_similarity() → Score (0 to 1)

Similar meanings produce similar vectors. cosine_similarity measures the angle between two vectors — the smaller the angle (closer to 1.0), the more semantically similar the texts are.

Q: What do "dimensions" mean?

model.encode(["hello"])
  MiniLM-L6  → array of 384 numbers
  mpnet-base → array of 768 numbers
  bge-large  → array of 1024 numbers

More dimensions = richer representation = better at distinguishing subtle meaning differences. But also more memory usage and slower similarity search. For learning and prototyping, 384 dimensions (MiniLM) is more than enough.

The Biggest Realizations

1. Chunking matters more than model choice

Before this experiment, I thought models were the most important part of RAG. Now I know:

Improving your chunking strategy gives 10x more improvement than upgrading the embedding model.

Proof: Semantic chunking scored 0.6204 vs large chunks at 0.5186 — same model, same query, same document. The only difference was how the text was split.

2. Bad chunks can't be fixed downstream

If "receipt" is split into "re" and "ceipt", no embedding model — no matter how powerful — can reconstruct that meaning. Chunking is the first step in the pipeline, and errors here cascade through everything that follows.

3. The embedding model doesn't "understand" — it averages

An embedding model doesn't read a chunk and pick out the relevant part. It creates one vector that represents the overall meaning. This is why topic isolation matters — each chunk should represent one clear idea.

4. Retrieval quality controls RAG quality

Chunking directly controls retrieval precision and context preservation. And retrieval quality often matters more than model quality.

Quick Reference: Chunking Strategy Cheat Sheet (Asked GPT 5 to create a table)

Strategy	How It Works	Pros	Cons	When to Use
Fixed Size	Split at every N characters	Simple to implement	Breaks words, sentences, meaning	Almost never — use as a baseline only
Semantic (paragraph)	Split at topic boundaries (\n\n)	Preserves complete meaning per chunk	Requires well-formatted documents	Default choice for structured text
Sentence-level	Split at every sentence	High precision on exact matches	Loses context and relationships	When you need granular fact retrieval
Overlap	Fixed size with shared text between chunks	Preserves boundary context	Redundant storage, still character-based	When you must use fixed-size but want better quality
Recursive (advanced)	Try paragraph → sentence → word progressively	Adaptive to document structure	More complex to implement	Production RAG systems (LangChain uses this)

Code

I've uploaded the complete experiment code here:

https://github.com/VikasKad/ai-engineering-learning

Two files:

chunking.py — Demonstrates fixed vs semantic chunking (text splitting only, no embeddings)
embedding.py — Complete experiment with all 4 strategies, embeddings, and cosine similarity scores

What's Next

Now that I understand how chunking affects retrieval, the next challenge is:

How do real systems store and search embeddings efficiently?

In our experiment, we compared a query against just 4—8 chunks using cosine_similarity. But real systems have millions of chunks. You can't compare against every single one.

In the next post (Day 10), I'll explore:

Vector databases (Pinecone, Chroma, FAISS) — purpose-built for similarity search
How they index embeddings for fast retrieval at scale
Building an actual retrieval pipeline that could power a real RAG application

💭 Final Thought

This experiment completely changed how I think about RAG systems.

Before:

I thought chunking was just text splitting
I thought the model was the most important part
I didn't know what an embedding actually was

Now:

I see chunking as one of the most important retrieval design decisions
I know that fixing chunks gives bigger improvements than upgrading models
I understand that embeddings are numerical representations of meaning, and their quality depends on what text you feed them
I can set up HuggingFace models locally and choose the right one for my use case

This is Day 11of my AI engineering journey — and this was the first time I truly saw retrieval quality changing live through experimentation.

JSDevLife

Wednesday, May 13, 2026

Day 11- AI Engineering - Testing Chunking Strategies in RAG

No comments:

Post a Comment

Trending

Blog Archive

Pages

Total Pageviews