Vector Search Basics — Cosine Similarity and ANN Indexes (HNSW, IVF, PQ)

Vector Search Basics — Cosine Similarity and ANN Indexes (HNSW, IVF, PQ)

This post is a practical guide to the building blocks of vector search. We’ll keep things hands-on and short: what cosine similarity is, why we use ANN (approximate nearest neighbor) indexes, and how HNSW, IVF, and PQ work in practice.

Cosine similarity in a nutshell

Cosine similarity measures how aligned two vectors are. Values are in [-1, 1], where 1 means “pointing in the same direction” (very similar), 0 is orthogonal (unrelated), and -1 is opposite.

  • We embed text into vectors (numbers)
  • We compare a query vector with document vectors
  • We rank by similarity and take the top-k

Tiny C# example: cosine and top-k

using System;
using System.Linq;
using System.Collections.Generic;

static double Cosine(double[] a, double[] b)
{
    if (a.Length != b.Length) throw new ArgumentException("Vector lengths differ");
    double dot = a.Zip(b, (x, y) => x * y).Sum();
    double na = Math.Sqrt(a.Sum(x => x * x));
    double nb = Math.Sqrt(b.Sum(x => x * x));
    return dot / (na * nb);
}

static IEnumerable<(string id, double score)> TopK(
    Dictionary<string, double[]> corpus,
    double[] query,
    int k = 3)
{
    return corpus
        .Select(kv => (id: kv.Key, score: Cosine(kv.Value, query)))
        .OrderByDescending(x => x.score)
        .Take(k);
}

// Tiny toy dataset
var docs = new Dictionary<string, double[]>
{
    ["espresso"]     = new [] { 0.60, 0.08, -0.25, 0.47 },
    ["coffee"]       = new [] { 0.62, 0.11, -0.28, 0.44 },
    ["battery_life"] = new [] { 0.15, -0.32, 0.77, 0.21 },
    ["cat"]          = new [] { -0.12, 0.91, 0.33, -0.05 },
};

var query = new [] { 0.61, 0.10, -0.26, 0.45 }; // “I like coffee/espresso”
foreach (var (id, score) in TopK(docs, query, k: 2))
{
    Console.WriteLine($"{id}: {score:F3}");
}

Brute-force top-k like this works for small corpora. At scale (millions+ vectors), you need an ANN index.

Why ANN indexes?

  • Exact search (compare with everyone) is O(N × d) per query — expensive at large N.
  • ANN structures speed up queries dramatically by exploring only promising regions.
  • You trade a tiny bit of recall for huge gains in latency and throughput.

Common families:

  • HNSW: graph-based; fast and strong recall; popular default.
  • IVF(-Flat, -PQ): partition-based; great for very large datasets; flexible memory/latency trade-offs.
  • PQ: quantization to compress vectors; lowers memory and I/O, modestly affecting recall.

At‑a‑glance: definitions

  • HNSW: layered small-world graphs for ANN
  • IVF: coarse quantization with cell probing (nlist/nprobe)
  • PQ: subspace codebooks for compact approximate distances

Glossary: common ANN parameters

Quick reference for the knobs you’ll tune most often. See the detailed sections for context: HNSW, IVF.

  • M (HNSW): Number of neighbors per node (graph connectivity). Higher M → denser graph, better recall, more memory/build time.
  • efConstruction (HNSW): Candidate list size during index build. Higher → better recall, slower build, more memory during construction.
  • efSearch (HNSW): Candidate list size during query. Higher → better recall, higher latency; tune to balance speed/quality.
  • nlist (IVF): Number of coarse centroids (cells). Higher → finer partitioning; more memory and training cost.
  • nprobe (IVF): Number of cells probed at query time. Higher → better recall, slower queries; key runtime dial.

HNSW (Hierarchical Navigable Small World)

HNSW builds a layered, small-world graph. Search walks the graph from coarse layers down to fine layers, greedily moving to neighbors that look closer to the query.

In practice, imagine three layers (top: coarse, bottom: fine). A query like “lightweight travel laptop” starts at a top‑layer entry node, hops to a neighbor that seems closer (e.g., a node representing “portable notebooks”), then descends a layer and continues greedy hops among local neighbors (e.g., “ultrabooks”, “battery life”), finally reaching the bottom layer where connections are dense. The search keeps a small candidate list (controlled by efSearch) so it explores a handful of promising paths without scanning the whole graph, returning the nearest items (e.g., chunks about ultrabooks with long battery life) quickly and with high recall.

Diagram (HNSW layers)

Layer 0 - fine

Layer 1 - mid

Layer 2 - coarse

Layer 2 - coarse

ENTRY

X

Layer 1 - mid

A

B

C

D

Layer 0 - fine

n1

n2

n3

n4

n5

n6

query

Key knobs:

  • M: number of neighbors per node (graph connectivity)
  • efConstruction: candidate list size while building (higher → better recall, slower build)
  • efSearch: candidate list size while querying (higher → better recall, slower queries)

Pros:

  • Excellent accuracy/speed balance
  • Great default for many workloads

Cons:

  • Build time and memory overhead can be higher than centroid/IVF approaches

Conceptual usage

// Pseudocode-ish structure — real libraries expose similar knobs
var index = new HnswIndex(dim: 768, m: 16, efConstruction: 200);
index.Add(vectors);  // build the graph; can be streamed/batched
index.Optimize();    // optional depending on library

var results = index.Search(queryVector, k: 10, efSearch: 64);
// returns top-10 approximate neighbors with their scores/ids

IVF (Inverted File Index)

IVF partitions the vector space into a fixed number of coarse cells (centroids). Each vector is assigned to its closest centroid. At query time, you only search a subset of cells.

In practice, imagine nlist=4096 centroids. When you add data, each vector is routed to the bucket of its nearest centroid. A query like “long battery life travel laptop” is embedded and compared to the centroids; with nprobe=8, you probe the 8 closest buckets and scan only their vectors. This avoids touching most of the corpus while keeping recall high if nprobe is tuned properly. Larger nlist gives finer partitions; larger nprobe explores more buckets (better recall, more work).

Diagram (IVF cells and probing)

Buckets

Centroids

C0

C1

C2

C3

Bucket C0

Bucket C1

Bucket C2

Bucket C3

query

Key knobs:

  • nlist: number of coarse cells (more cells → finer partition)
  • nprobe: how many cells to probe at query time (higher → better recall, slower)

Variants:

  • IVF-Flat: store full vectors in each cell; accurate, higher memory
  • IVF-PQ: store PQ-compressed vectors in each cell; lower memory, slightly lower recall

Conceptual usage

// Pseudocode-ish structure
var ivf = new IvfIndex(dim: 768, nlist: 4096);
ivf.Train(sampleVectors);  // learn centroids
ivf.Add(vectors);          // assign vectors to their nearest centroid

var results = ivf.Search(queryVector, k: 10, nprobe: 8); // probe 8 cells

PQ (Product Quantization)

PQ compresses vectors by splitting the vector into subspaces and quantizing each subspace with its own codebook. The result is a compact code per vector that approximates distances fast.

In practice, instead of storing a full 768‑dim float vector, you split it into m sub-vectors (e.g., m=64) and map each sub-vector to a small code from its codebook. At search time, you precompute lookup tables between the query’s sub-vectors and each codebook’s centroids, then sum table entries to approximate distances. This lets you scan far more candidates from memory cache. Pair PQ with IVF (IVF‑PQ) to both skip most buckets and compress what you do scan. To recover accuracy, re‑rank the final top‑K using full‑precision vectors when available.

Diagram (PQ subspaces)

Vector d=768

Subvector 0 0..191

Subvector 1 192..383

Subvector 2 384..575

Subvector 3 576..767

Codebook 0

c0

Codebook 1

c1

Codebook 2

c2

Codebook 3

c3

Compressed code: c0 c1 c2 c3

Why use PQ?

  • Memory: shrink 768-d float vectors by 4–16× or more
  • Speed: smaller data means more fits in cache, faster scans
  • Trade-off: recall drops a bit vs. full-precision vectors

Conceptual usage

// Pseudocode-ish structure
var pq = new ProductQuantizer(dim: 768, m: 64, bits: 8); // m subspaces, 8-bit codes
pq.Train(sampleVectors);
var codes = pq.Encode(vectors);

// Can be used standalone for compressed search or inside IVF-PQ/HNSW-PQ hybrids
var results = pq.SearchCompressed(codes, queryVector, k: 10);

Putting it together: a practical recipe

  • Start simple: brute-force search to validate embeddings and evaluation.
  • Move to HNSW as a strong baseline for general-purpose ANN.
  • For very large datasets or tight RAM: consider IVF-Flat; if memory is tight, IVF-PQ.
  • Tune the key knobs (efSearch, nprobe) using an evaluation set (recall vs. latency).
  • Consider hybrid search (keyword + vector) and reranking if your domain benefits from lexical signals.

Key takeaways

  • Cosine similarity ranks by angular closeness; embeddings turn meaning into geometry.
  • ANN indexes avoid scanning everything; HNSW and IVF(-PQ) are proven choices.
  • PQ compresses vectors to save memory and speed scans at small recall cost.
  • Evaluate with your data: measure recall/latency and tune efSearch/nprobe.