Retrieval‑augmented generation (RAG) patterns and evaluation
This post is a concise, practical guide to RAG: the patterns that actually ship and how to evaluate them. We’ll keep it grounded and hands‑on, building on the vector search basics from the previous post.
What is RAG (really)?
“Retrieve then generate.” You fetch factual context (chunks) relevant to the user’s question and pass it to the LLM to synthesize an answer. Retrieval handles recall; the LLM handles reasoning and writing.
High‑level flow:
- Embed the query → find top‑k similar chunks (hybrid/vector search)
- Optionally re‑rank candidates with a stronger model
- Assemble a compact context window (with citations)
- Prompt the LLM to answer grounded in that context
Core building blocks
- Query understanding: normalization, expansion, or rewriting (e.g., HyDE)
- Retrieval: vector or hybrid (BM25 + vectors), metadata filters, top‑k tuning
- Post‑retrieval: de‑dupe, windowing, cross‑encoder re‑ranking
- Context construction: ordering, trimming, citations/attributions
- Generation: system prompt, answer format, citation style, safety rails
RAG patterns you’ll use
1) Basic single‑hop RAG
- Vector search top‑k, optionally hybrid with BM25
- Concatenate chunks, add a compact instruction prompt, return an answer with citations
When to use: small/medium corpora, straightforward questions.
2) Hybrid search + re‑rank
- Combine lexical (BM25) and semantic search, then re‑rank with a cross‑encoder or a small LLM
- Improves precision on head and tail queries; great default beyond toy scale
3) Parent–child (multi‑vector) retrieval
- Embed “children” (paragraphs) but return the “parent” (section/page) to give the LLM richer context
- Reduces truncation and improves coherence vs. many tiny fragments
4) Multi‑query / HyDE
- Generate alternative queries or a hypothetical answer (HyDE), embed, and retrieve for each; merge and de‑dupe
- Helps when the user query is short, ambiguous, or domain‑jargony
5) Multi‑stage cascades
- Cheap, broad recall (vector/hybrid) → narrow with filters → re‑rank → final K
- Tune each stage for the best recall/latency trade‑off
6) Agentic/tool‑augmented RAG
- If retrieval confidence is low, call tools (search, APIs, DB queries) then regenerate
- Useful for fresh data and when the corpus can’t answer by itself
Evaluation: what to measure and how
Evaluate retrieval and generation separately, then together. Keep a small, realistic eval set; automate scoring where possible and spot‑check with humans.
Retrieval metrics
- Recall@K: did the gold passage(s) appear in the top‑K?
- Precision@K: how many of the top‑K were actually relevant?
- nDCG / MRR: position‑sensitive ranking quality
- Coverage: how often at least one supporting chunk is present
Answer metrics
- Groundedness (faithfulness): is each claim supported by the provided context?
- Correctness: does the answer match the reference (exact‑match or model‑graded)?
- Citation accuracy: do citations point to supporting chunks?
- Helpfulness/format: is it concise, direct, and follows the requested format?
Efficiency metrics
- Latency breakdown: embed, retrieve, re‑rank, generate
- Token/cost accounting: prompt tokens, output tokens, cache hit rate
Tiny C# snippets for eval
The idea is to keep scoring simple and fast so you can iterate daily.
Recall@K (per question)
using System;
using System.Collections.Generic;
using System.Linq;
// gold: the set of relevant doc IDs for this query
// retrieved: the ranked list of doc IDs returned by your system
static double RecallAtK(HashSet<string> gold, IList<string> retrieved, int k)
{
if (gold.Count == 0) return 1.0; // degenerate case
var topK = retrieved.Take(Math.Min(k, retrieved.Count)).ToHashSet();
var hits = gold.Intersect(topK).Count();
return (double)hits / gold.Count;
}
// Example
var gold = new HashSet<string> { "doc42", "doc7" };
var retrieved = new List<string> { "doc10", "doc42", "doc3", "doc99" };
Console.WriteLine($"Recall@3 = {RecallAtK(gold, retrieved, 3):F2}");
Simple groundedness check (proxy)
For quick iteration, a rough lexical proxy can help catch clear hallucinations. It’s not perfect—use model‑graded faithfulness for serious evals.
using System;
using System.Linq;
// Very naive: require that each quoted span in the answer appears in the context
static bool QuotesGrounded(string answer, string context)
{
var quotes = answer
.Split('"')
.Where((_, i) => i % 2 == 1)
.Select(q => q.Trim())
.Where(q => q.Length > 0);
return quotes.All(q => context.IndexOf(q, StringComparison.OrdinalIgnoreCase) >= 0);
}
var ctx = "Battery life tests show 12–14 hours under mixed workloads.";
var ans = "According to tests, \"12–14 hours\" is typical for travel ultrabooks.";
Console.WriteLine($"Quotes grounded: {QuotesGrounded(ans, ctx)}");
Grading with an LLM (concept)
- Provide: question, context, answer, rubric
- Ask the model to return a JSON score for correctness and faithfulness with short rationales
- Keep prompts short and deterministic; sample multiple times if needed and take the median
Tuning playbook (short)
- Establish a baseline: basic RAG, k=5, no re‑rank. Measure recall@K and latency.
- Add hybrid search and increase k until recall plateaus; cap by latency.
- Try cross‑encoder re‑ranking or a small LLM re‑rank on the top 50 → top 5.
- Upgrade chunking (parent–child) and add metadata filters.
- Introduce HyDE or multi‑query for ambiguous/short queries.
- Lock an eval set and track metrics over time. Regressions fail the build.
Key takeaways
- RAG = retrieval for recall + LLM for reasoning; evaluate both sides.
- Hybrid search and re‑ranking are strong defaults for production.
- Parent–child chunking and query rewriting (HyDE) boost robustness.
- Measure recall@K, groundedness, and latency—then tune efSearch/nprobe/k.