Retrieval‑augmented generation (RAG) patterns and evaluation

This post is a concise, practical guide to RAG: the patterns that actually ship and how to evaluate them. We’ll keep it grounded and hands‑on, building on the vector search basics from the previous post.

What is RAG (really)?

“Retrieve then generate.” You fetch factual context (chunks) relevant to the user’s question and pass it to the LLM to synthesize an answer. Retrieval handles recall; the LLM handles reasoning and writing.

High‑level flow:

Embed the query → find top‑k similar chunks (hybrid/vector search)
Optionally re‑rank candidates with a stronger model
Assemble a compact context window (with citations)
Prompt the LLM to answer grounded in that context

Core building blocks

Query understanding: normalization, expansion, or rewriting (e.g., HyDE)
Retrieval: vector or hybrid (BM25 + vectors), metadata filters, top‑k tuning
Post‑retrieval: de‑dupe, windowing, cross‑encoder re‑ranking
Context construction: ordering, trimming, citations/attributions
Generation: system prompt, answer format, citation style, safety rails

RAG patterns you’ll use

1) Basic single‑hop RAG

Vector search top‑k, optionally hybrid with BM25
Concatenate chunks, add a compact instruction prompt, return an answer with citations

When to use: small/medium corpora, straightforward questions.

2) Hybrid search + re‑rank

Combine lexical (BM25) and semantic search, then re‑rank with a cross‑encoder or a small LLM
Improves precision on head and tail queries; great default beyond toy scale

3) Parent–child (multi‑vector) retrieval

Embed “children” (paragraphs) but return the “parent” (section/page) to give the LLM richer context
Reduces truncation and improves coherence vs. many tiny fragments

4) Multi‑query / HyDE

Generate alternative queries or a hypothetical answer (HyDE), embed, and retrieve for each; merge and de‑dupe
Helps when the user query is short, ambiguous, or domain‑jargony

5) Multi‑stage cascades

Cheap, broad recall (vector/hybrid) → narrow with filters → re‑rank → final K
Tune each stage for the best recall/latency trade‑off

6) Agentic/tool‑augmented RAG

If retrieval confidence is low, call tools (search, APIs, DB queries) then regenerate
Useful for fresh data and when the corpus can’t answer by itself

Evaluation: what to measure and how

Evaluate retrieval and generation separately, then together. Keep a small, realistic eval set; automate scoring where possible and spot‑check with humans.

Retrieval metrics

Recall@K: did the gold passage(s) appear in the top‑K?
Precision@K: how many of the top‑K were actually relevant?
nDCG / MRR: position‑sensitive ranking quality
Coverage: how often at least one supporting chunk is present

Answer metrics

Groundedness (faithfulness): is each claim supported by the provided context?
Correctness: does the answer match the reference (exact‑match or model‑graded)?
Citation accuracy: do citations point to supporting chunks?
Helpfulness/format: is it concise, direct, and follows the requested format?

Efficiency metrics

Latency breakdown: embed, retrieve, re‑rank, generate
Token/cost accounting: prompt tokens, output tokens, cache hit rate

Tiny C# snippets for eval

The idea is to keep scoring simple and fast so you can iterate daily.

Recall@K (per question)

using System;
using System.Collections.Generic;
using System.Linq;

// gold: the set of relevant doc IDs for this query
// retrieved: the ranked list of doc IDs returned by your system
static double RecallAtK(HashSet<string> gold, IList<string> retrieved, int k)
{
    if (gold.Count == 0) return 1.0; // degenerate case
    var topK = retrieved.Take(Math.Min(k, retrieved.Count)).ToHashSet();
    var hits = gold.Intersect(topK).Count();
    return (double)hits / gold.Count;
}

// Example
var gold = new HashSet<string> { "doc42", "doc7" };
var retrieved = new List<string> { "doc10", "doc42", "doc3", "doc99" };
Console.WriteLine($"Recall@3 = {RecallAtK(gold, retrieved, 3):F2}");

Simple groundedness check (proxy)

For quick iteration, a rough lexical proxy can help catch clear hallucinations. It’s not perfect—use model‑graded faithfulness for serious evals.

using System;
using System.Linq;

// Very naive: require that each quoted span in the answer appears in the context
static bool QuotesGrounded(string answer, string context)
{
    var quotes = answer
        .Split('"')
        .Where((_, i) => i % 2 == 1)
        .Select(q => q.Trim())
        .Where(q => q.Length > 0);

    return quotes.All(q => context.IndexOf(q, StringComparison.OrdinalIgnoreCase) >= 0);
}

var ctx = "Battery life tests show 12–14 hours under mixed workloads.";
var ans = "According to tests, \"12–14 hours\" is typical for travel ultrabooks.";
Console.WriteLine($"Quotes grounded: {QuotesGrounded(ans, ctx)}");

Grading with an LLM (concept)

Provide: question, context, answer, rubric
Ask the model to return a JSON score for correctness and faithfulness with short rationales
Keep prompts short and deterministic; sample multiple times if needed and take the median