Chunking strategies, metadata filters, and hybrid search (BM25 + vectors)
This post focuses on three levers that dramatically improve retrieval quality: how you chunk, how you filter, and how you fuse lexical with semantic search.
Chunking strategies
Your chunker shapes recall and answer quality. Start simple, test, then evolve.
Fixed size with overlap
- Split by tokens/characters with an overlap (e.g., 800 tokens, 15–20% overlap)
- Pros: simple, robust; Cons: can split concepts awkwardly
Structural (semantic) chunking
- Split at natural boundaries: headings, sections, paragraphs, code blocks
- Pros: coherent chunks; Cons: variable sizes; needs a parser
Sliding window (query‑time)
-
At retrieval, expand hits to include neighboring windows for context continuity
-
Pairs well with fixed size chunking
-
Pros: easy to add without re‑indexing; preserves local context around hits; reduces mid‑sentence cutoffs
-
Cons: increases token usage; can drag in irrelevant neighbors; needs careful window size to avoid bloat
Parent–child (multi‑vector) indexing
-
Index children (paragraphs) but return the parent (section/page) to the LLM
-
Improves coherence and reduces truncation vs. many tiny fragments
-
Pros: richer, coherent context; fewer truncations; better groundedness and citation surface
-
Cons: more storage/joins and ingestion complexity; must maintain robust child→parent mapping; risk of pulling an over‑broad parent if child match is weak
Tiny C# example: fixed chunker with overlap
using System;
using System.Collections.Generic;
static IEnumerable<(int idx, string text)> Chunk(string text, int size = 1200, int overlap = 200)
{
if (size <= 0 || overlap < 0 || overlap >= size) throw new ArgumentException("bad params");
for (int start = 0, i = 0; start < text.Length; start += (size - overlap), i++)
{
var len = Math.Min(size, text.Length - start);
yield return (i, text.Substring(start, len));
if (start + len >= text.Length) yield break;
}
}
Metadata filters
Metadata lets you narrow search to the right slice of the corpus before ranking.
Common fields:
- type (doc, faq, api, policy), product, version, locale
- author, date range, tags, path/tenant, security label
Filter patterns:
- Pre‑filter at the vector DB level (fast, coarse)
- Post‑filter after initial retrieval (precise, but you already paid retrieval cost)
C# sketch: applying filters to candidates
using System;
using System.Collections.Generic;
using System.Linq;
record Doc(string Id, double[] Emb, string Type, string Product, DateTime Dt);
static IEnumerable<Doc> ApplyFilters(IEnumerable<Doc> docs, string type = null, string product = null, DateTime? from = null)
{
return docs.Where(d => (type == null || d.Type == type)
&& (product == null || d.Product == product)
&& (!from.HasValue || d.Dt >= from.Value));
}
Tip: keep filters mirrored in both your storage schema and your retrieval API so they’re cheap to apply.
Hybrid search (BM25 + vectors)
Lexical and semantic signals are complementary. Hybrid search improves both head queries (exact terms) and tail queries (paraphrases).
Fusion strategies:
- Score fusion (weighted sum): normalize BM25 and cosine to [0,1] and combine
- Reciprocal Rank Fusion (RRF): fuse by rank positions; robust default
- Two‑stage: take union of BM25 and vector top‑N, then re‑rank with a cross‑encoder
Tiny C# example: RRF score fusion
using System;
using System.Collections.Generic;
using System.Linq;
// rrf(k) = 1 / (k + c), typically c = 60
static Dictionary<string, double> RrfFuse(IList<string> bm25, IList<string> vec, int c = 60)
{
var scores = new Dictionary<string, double>();
void add(IList<string> lst)
{
for (int i = 0; i < lst.Count; i++)
{
var id = lst[i];
var s = 1.0 / (i + 1 + c);
scores[id] = scores.ContainsKey(id) ? scores[id] + s : s;
}
}
add(bm25);
add(vec);
return scores
.OrderByDescending(kv => kv.Value)
.ToDictionary(kv => kv.Key, kv => kv.Value);
}
var bm25 = new List<string> { "doc3", "doc8", "doc1" };
var vec = new List<string> { "doc1", "doc2", "doc8" };
var fused = RrfFuse(bm25, vec);
foreach (var kv in fused)
Console.WriteLine($"{kv.Key}: {kv.Value:F4}");
Practical knobs and tips
- Normalize scores before weighted fusion; otherwise ranks can be unstable
- Use BM25 filters (must‑include terms) for exact IDs, codes, numbers
- Increase vector top‑N when you re‑rank; small N hides good candidates
- Measure with your eval set: recall@K and latency per stage
Putting it together: a simple recipe
- Start with fixed chunks (overlap 15–20%) and a few metadata fields (type, product, date)
- Add hybrid retrieval (BM25 ∪ vectors) and fuse with RRF
- Re‑rank the union top‑50 → top‑5
- Upgrade chunking to structural/parent–child if you see truncation or incoherence
- Tune k, filters, and fusion weights using recall/latency
Key takeaways
- Chunking coherence and overlap matter as much as the embedding model
- Metadata filters reduce noise and cost by cutting the search space
- Hybrid (BM25 + vectors) is a strong, robust default beyond toy datasets
- Evaluate and tune with your data, not benchmarks alone