Chunking strategies, metadata filters, and hybrid search (BM25 + vectors)

This post focuses on three levers that dramatically improve retrieval quality: how you chunk, how you filter, and how you fuse lexical with semantic search.

Chunking strategies

Your chunker shapes recall and answer quality. Start simple, test, then evolve.

Fixed size with overlap

Split by tokens/characters with an overlap (e.g., 800 tokens, 15–20% overlap)
Pros: simple, robust; Cons: can split concepts awkwardly

Structural (semantic) chunking

Split at natural boundaries: headings, sections, paragraphs, code blocks
Pros: coherent chunks; Cons: variable sizes; needs a parser

Sliding window (query‑time)

At retrieval, expand hits to include neighboring windows for context continuity
Pairs well with fixed size chunking
Pros: easy to add without re‑indexing; preserves local context around hits; reduces mid‑sentence cutoffs
Cons: increases token usage; can drag in irrelevant neighbors; needs careful window size to avoid bloat

Parent–child (multi‑vector) indexing

Index children (paragraphs) but return the parent (section/page) to the LLM
Improves coherence and reduces truncation vs. many tiny fragments
Pros: richer, coherent context; fewer truncations; better groundedness and citation surface
Cons: more storage/joins and ingestion complexity; must maintain robust child→parent mapping; risk of pulling an over‑broad parent if child match is weak

Tiny C# example: fixed chunker with overlap

using System;
using System.Collections.Generic;

static IEnumerable<(int idx, string text)> Chunk(string text, int size = 1200, int overlap = 200)
{
    if (size <= 0 || overlap < 0 || overlap >= size) throw new ArgumentException("bad params");
    for (int start = 0, i = 0; start < text.Length; start += (size - overlap), i++)
    {
        var len = Math.Min(size, text.Length - start);
        yield return (i, text.Substring(start, len));
        if (start + len >= text.Length) yield break;
    }
}

Metadata filters

Metadata lets you narrow search to the right slice of the corpus before ranking.

Common fields:

type (doc, faq, api, policy), product, version, locale
author, date range, tags, path/tenant, security label

Filter patterns:

Pre‑filter at the vector DB level (fast, coarse)
Post‑filter after initial retrieval (precise, but you already paid retrieval cost)

C# sketch: applying filters to candidates

using System;
using System.Collections.Generic;
using System.Linq;

record Doc(string Id, double[] Emb, string Type, string Product, DateTime Dt);

static IEnumerable<Doc> ApplyFilters(IEnumerable<Doc> docs, string type = null, string product = null, DateTime? from = null)
{
    return docs.Where(d => (type == null || d.Type == type)
                        && (product == null || d.Product == product)
                        && (!from.HasValue || d.Dt >= from.Value));
}

Tip: keep filters mirrored in both your storage schema and your retrieval API so they’re cheap to apply.

Hybrid search (BM25 + vectors)

Lexical and semantic signals are complementary. Hybrid search improves both head queries (exact terms) and tail queries (paraphrases).

Fusion strategies:

Score fusion (weighted sum): normalize BM25 and cosine to [0,1] and combine
Reciprocal Rank Fusion (RRF): fuse by rank positions; robust default
Two‑stage: take union of BM25 and vector top‑N, then re‑rank with a cross‑encoder

Tiny C# example: RRF score fusion

using System;
using System.Collections.Generic;
using System.Linq;

// rrf(k) = 1 / (k + c), typically c = 60
static Dictionary<string, double> RrfFuse(IList<string> bm25, IList<string> vec, int c = 60)
{
    var scores = new Dictionary<string, double>();
    void add(IList<string> lst)
    {
        for (int i = 0; i < lst.Count; i++)
        {
            var id = lst[i];
            var s = 1.0 / (i + 1 + c);
            scores[id] = scores.ContainsKey(id) ? scores[id] + s : s;
        }
    }
    add(bm25);
    add(vec);
    return scores
        .OrderByDescending(kv => kv.Value)
        .ToDictionary(kv => kv.Key, kv => kv.Value);
}

var bm25 = new List<string> { "doc3", "doc8", "doc1" };
var vec  = new List<string> { "doc1", "doc2", "doc8" };
var fused = RrfFuse(bm25, vec);
foreach (var kv in fused)
    Console.WriteLine($"{kv.Key}: {kv.Value:F4}");

Practical knobs and tips

Normalize scores before weighted fusion; otherwise ranks can be unstable
Use BM25 filters (must‑include terms) for exact IDs, codes, numbers
Increase vector top‑N when you re‑rank; small N hides good candidates
Measure with your eval set: recall@K and latency per stage

Putting it together: a simple recipe

Start with fixed chunks (overlap 15–20%) and a few metadata fields (type, product, date)
Add hybrid retrieval (BM25 ∪ vectors) and fuse with RRF
Re‑rank the union top‑50 → top‑5
Upgrade chunking to structural/parent–child if you see truncation or incoherence
Tune k, filters, and fusion weights using recall/latency