Semantic Index with Azure AI Search

Semantic Index with Azure AI Search

Azure AI Search gives you a single, enterprise-ready retrieval engine for RAG and classic search: vectors, keyword, semantic ranking, filters, and security in one place. In this post we’ll build an end‑to‑end setup in C#:

  • Create a vector index
  • Load data with embeddings (push API) or via an indexer pipeline
  • Chunk PDFs and images with OCR and image analysis
  • Choose embedding models and dimensions
  • Enforce permissions (filters and ACL/RBAC)
  • Bind the index to Semantic Kernel, with key auth, managed identity, and On‑Behalf‑Of (OBO)

What is Azure AI Search (quickly)

Azure AI Search is a managed search service. For vector search, you store embeddings in vector fields and query them with a vector query (often combined with keyword filters or semantic ranking). A vector index schema includes:

  • A key field (string)
  • One or more vector fields (float arrays, with dimensions matching your embedding model)
  • Optional non‑vector fields for filters, facets, and readable content
  • A vector search configuration (HNSW or exhaustiveKnn, optional compression) [docs]

References: Vector index schema, vector field rules, vector search config.

Create a vector index in C#

The Azure SDK exposes SearchIndexClient for index management. The snippet below creates a simple index with a vector field for content and HNSW as the vector profile.

using Azure;
using Azure.Search.Documents;
using Azure.Search.Documents.Indexes;
using Azure.Search.Documents.Indexes.Models;

var endpoint = new Uri("https://<your-search-service>.search.windows.net");
var credential = new AzureKeyCredential("<admin-key>"); // or use DefaultAzureCredential (see auth section)

var indexClient = new SearchIndexClient(endpoint, credential);

var fields = new List<SearchField>
{
  new SimpleField("id", SearchFieldDataType.String) { IsKey = true, IsFilterable = true },
  new SearchField("content", SearchFieldDataType.String) { IsSearchable = true },
  // Vector field must be searchable and sized to your embedding model dimensions
  new VectorSearchField("contentVector", dimensions: 1536, profileName: "my-vector-profile")
};

var vectorSearch = new VectorSearch
{
  Algorithms = { new HnswAlgorithmConfiguration(name: "hnsw-config") },
  Profiles   = { new VectorSearchProfile(name: "my-vector-profile", algorithmConfigurationName: "hnsw-config") }
};

var semantic = new SemanticSearch
{
  Configurations =
  {
    new SemanticConfiguration(
      name: "semantic-config",
      prioritizedFields: new SemanticPrioritizedFields
      {
        ContentFields = { new SemanticField("content") }
      })
  }
};

var index = new SearchIndex("docs-vector")
{
  Fields = fields,
  VectorSearch = vectorSearch,
  SemanticSearch = semantic
};

await indexClient.CreateOrUpdateIndexAsync(index);

Notes:

  • For text-embedding-3-small the common dimension is 1536; text-embedding-3-large commonly uses 3072. Use the exact dimension your model emits.
  • You can also configure a vectorizer for query‑time text‑to‑vector to keep client code simpler.

Option A: Load data with precomputed embeddings (C# push)

Use Azure OpenAI (or another embedding service) to compute vectors, then push documents with both content and vectors.

using Azure;
using Azure.AI.OpenAI;
using Azure.Search.Documents;
using Azure.Search.Documents.Models;

var searchClient = new SearchClient(endpoint, "docs-vector", credential);

// 1) Create embeddings
var aoai = new OpenAIClient(new Uri("https://<your-aoai>.openai.azure.com"), new AzureKeyCredential("<aoai-key>"));
var text = "Azure AI Search unifies vector and keyword search.";
var embedding = (await aoai.GetEmbeddingsAsync(
  new EmbeddingsOptions("text-embedding-3-small", new[] { text })
)).Value.Data[0].Embedding.ToArray();

// 2) Upload document with vector
var doc = new
{
  id = Guid.NewGuid().ToString("N"),
  content = text,
  contentVector = embedding // float[] length must match vector field dimensions
};

await searchClient.IndexDocumentsAsync(IndexDocumentsBatch.Upload(new[] { doc }));

Pros: Full control over chunking and vectorization in your app. Cons: You manage all preprocessing.

Option B: Build an indexer pipeline (chunk + OCR + image → embeddings)

If your content lives in Azure Storage (PDFs, images, Office files), use an indexer + skillset. The pipeline cracks files, chunks text, runs OCR and image analysis, and generates embeddings via a built‑in embedding skill.

using Azure;
using Azure.Search.Documents.Indexes;
using Azure.Search.Documents.Indexes.Models;

var indexerClient = new SearchIndexerClient(endpoint, credential);

// Text Split: chunk merged text into pages
var split = new SplitSkill(
  inputs: new[] { new InputFieldMappingEntry("text") { Source = "/document/merged_text" } },
  outputs: new[] { new OutputFieldMappingEntry("textItems") { TargetName = "pages" } })
{
  Context = "/document",
  TextSplitMode = TextSplitMode.Pages,
  MaximumPageLength = 2000,
  PageOverlapLength = 200,
  DefaultLanguageCode = SplitSkillLanguage.En
};

// OCR for images inside PDFs
var ocr = new OcrSkill(
  inputs: new[] { new InputFieldMappingEntry("image") { Source = "/document/normalized_images/*" } },
  outputs: new[] { new OutputFieldMappingEntry("text") { TargetName = "text" } })
{
  Context = "/document/normalized_images/*",
  ShouldDetectOrientation = true,
  DefaultLanguageCode = OcrSkillLanguage.En
};

// Merge original text + OCR text before chunking
var merge = new MergeSkill(
  inputs: new[]
  {
    new InputFieldMappingEntry("text") { Source = "/document/content" },
    new InputFieldMappingEntry("itemsToInsert") { Source = "/document/normalized_images/*/text" },
    new InputFieldMappingEntry("offsets") { Source = "/document/normalized_images/*/contentOffset" }
  },
  outputs: new[] { new OutputFieldMappingEntry("mergedText") { TargetName = "merged_text" } })
{
  Context = "/document"
};

// Embedding skill (Azure OpenAI) – generates vectors per page
var embed = new AzureOpenAIEmbeddingSkill(
  inputs: new[] { new InputFieldMappingEntry("text") { Source = "/document/pages/*" } },
  outputs: new[] { new OutputFieldMappingEntry("embedding") { TargetName = "contentVector" } })
{
  Context = "/document/pages/*",
  // Also set: ResourceUri, DeploymentName, ModelName, Dimensions to match your model
};

var skillset = new SearchIndexerSkillset(
  name: "rag-skillset",
  skills: new SearchIndexerSkill[] { ocr, merge, split, embed })
{
  Description = "OCR + merge + chunk + embeddings"
};

await indexerClient.CreateOrUpdateSkillsetAsync(skillset);

// In your indexer parameters: generateNormalizedImages so OCR has input
var parameters = new IndexingParameters
{
  Configuration = { ["dataToExtract"] = "contentAndMetadata", ["imageAction"] = "generateNormalizedImages" }
};

Tips:

  • For “headline to headline” chunking, consider Document Layout skill for structure‑aware chunks, or tune TextSplitSkill page/overlap length.
  • To “describe images and store as chunks”, combine OCR with ImageAnalysisSkill or GenAI Prompt skill to verbalize images and then embed those captions.

Indexing a Blob container (website dump)

Azure AI Search can index content directly from Azure Blob Storage. A common pattern is to mirror a public website into a container and let an indexer extract HTML content and metadata for search or RAG grounding.

What you’ll set up

  • A storage account and a blob container (holds the mirrored site)
  • An Azure AI Search index (fields for id, url, title, content, metadata)
  • A data source connection to the blob container
  • An indexer that maps blob metadata to your fields and runs on a schedule

You can do this in the Azure portal (Data sources → Indexers), or script it end‑to‑end as shown below.

PowerShell end‑to‑end sample (download → upload → index)

The script uses wget to mirror a website locally, uploads the files to a blob container, then creates a Search index, data source, and indexer via the Search REST API.

Requirements:

  • PowerShell 7+
  • Azure CLI (logged in) and the Search extension: az extension add -n search
  • wget installed (on macOS: brew install wget)
$ErrorActionPreference = "Stop"

# ====== CONFIG (edit to your values) ======
$subscriptionId   = "<SUBSCRIPTION-ID>"
$location         = "westeurope"
$rg               = "rg-search-website-demo"
$searchService    = "<unique-search-name>"   # globally unique, e.g., petkirsearchdemo
$sku              = "basic"                  # basic|standard|standard2 etc.
$storageAccount   = "<uniquestorageacct>"    # lowercase, 3-24 chars
$container        = "website"
$indexName        = "webpages"
$websiteUrl       = "https://example.com"    # site to mirror
$downloadFolder   = Join-Path $PWD "site"

# ====== Azure resources ======
az account set --subscription $subscriptionId
az group create -n $rg -l $location | Out-Null

# Storage account + container
if (-not (az storage account show -n $storageAccount -g $rg -o none 2>$null)) {
  az storage account create -n $storageAccount -g $rg -l $location --sku Standard_LRS --kind StorageV2 | Out-Null
}
$storageKey = az storage account keys list -n $storageAccount -g $rg --query "[0].value" -o tsv
az storage container create --name $container --account-name $storageAccount --account-key $storageKey | Out-Null

# Search service
if (-not (az search service show --name $searchService -g $rg -o none 2>$null)) {
  az search service create --name $searchService -g $rg -l $location --sku $sku | Out-Null
}
$adminKey = az search admin-key show --service-name $searchService -g $rg --query primaryKey -o tsv

# ====== Mirror website locally with wget ======
if (-not (Get-Command wget -ErrorAction SilentlyContinue)) {
  throw "wget is required. On macOS: brew install wget"
}
if (Test-Path $downloadFolder) { Remove-Item -Recurse -Force $downloadFolder }
New-Item -ItemType Directory -Path $downloadFolder | Out-Null

# Flags:
# -e robots=off : ignore robots.txt for the mirror (ensure you have rights to crawl)
# -r            : recursive
# -np           : no parent (stay under the URL)
# -nH           : no host directory prefix
# -E            : add .html where needed
# -k            : convert links to local
# -p            : get all page requisites (images, css)
# -P <dir>      : download directory
wget -e robots=off -r -np -nH -E -k -p -P "$downloadFolder" "$websiteUrl"

# ====== Upload to blob ======
az storage blob upload-batch `
  --account-name $storageAccount `
  --account-key $storageKey `
  --destination $container `
  --source $downloadFolder `
  --overwrite true | Out-Null

# ====== Create index, data source, indexer ======
$headers = @{ "api-key" = $adminKey; "Content-Type" = "application/json" }
$baseUri = "https://$searchService.search.windows.net"
$api     = "2023-11-01"

# Index schema (text search focused; add vector fields if you plan RAG)
$indexJson = @"
{
  "name": "$indexName",
  "fields": [
    { "name": "id",   "type": "Edm.String", "key": true,  "searchable": false },
    { "name": "url",  "type": "Edm.String", "searchable": true },
    { "name": "title","type": "Edm.String", "searchable": true },
    { "name": "content","type": "Edm.String", "searchable": true },
    { "name": "metadata_storage_name", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true, "facetable": true },
    { "name": "metadata_content_type", "type": "Edm.String", "filterable": true, "facetable": true },
    { "name": "metadata_storage_size", "type": "Edm.Int64",  "filterable": true, "facetable": true, "sortable": true },
    { "name": "metadata_storage_last_modified", "type": "Edm.DateTimeOffset", "filterable": true, "sortable": true }
  ],
  "suggesters": [
    { "name": "sg", "searchMode": "analyzingInfixMatching", "sourceFields": [ "title", "content" ] }
  ]
}
"@

Invoke-RestMethod -Method Put `
  -Uri "$baseUri/indexes/$indexName?api-version=$api" `
  -Headers $headers -Body $indexJson | Out-Null

# Data source: points to the blob container
$datasourceName = "$($indexName)-ds"
$dsJson = @"
{
  "name": "$datasourceName",
  "type": "azureblob",
  "credentials": {
    "connectionString": "DefaultEndpointsProtocol=https;AccountName=$storageAccount;AccountKey=$storageKey;EndpointSuffix=core.windows.net"
  },
  "container": { "name": "$container" },
  "dataChangeDetectionPolicy": {
    "@odata.type": "#Microsoft.Azure.Search.HighWaterMarkChangeDetectionPolicy",
    "highWaterMarkColumnName": "metadata_storage_last_modified"
  }
}
"@

Invoke-RestMethod -Method Put `
  -Uri "$baseUri/datasources/$datasourceName?api-version=$api" `
  -Headers $headers -Body $dsJson | Out-Null

# Indexer: maps blob fields to index fields
$indexerName = "$($indexName)-idx"
$idxJson = @"
{
  "name": "$indexerName",
  "dataSourceName": "$datasourceName",
  "targetIndexName": "$indexName",
  "schedule": { "interval": "PT5M" },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "id",
      "mappingFunction": { "name": "base64Encode" }
    },
    { "sourceFieldName": "metadata_storage_path", "targetFieldName": "url" },
    { "sourceFieldName": "metadata_title",       "targetFieldName": "title" },
    { "sourceFieldName": "content",               "targetFieldName": "content" }
  ],
  "parameters": {
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "default",
      "failOnUnsupportedContentType": false,
      "failOnUnprocessableDocument": false
    }
  }
}
"@

Invoke-RestMethod -Method Put `
  -Uri "$baseUri/indexers/$indexerName?api-version=$api" `
  -Headers $headers -Body $idxJson | Out-Null

# Kick off the indexer now
Invoke-RestMethod -Method Post `
  -Uri "$baseUri/indexers/$indexerName/run?api-version=$api" `
  -Headers $headers | Out-Null

Write-Host "All set. Indexer '$indexerName' is running. Check portal → Search Service → Indexers for status." -ForegroundColor Green

Notes:

  • For RAG with vectors, augment the index with a vector field and switch to an indexer+skillset that generates embeddings (see the Option B section above).
  • Prefer managed identity for the data source in production; this sample uses a storage key for simplicity.
  • Indexer schedule (PT5M) runs every 5 minutes. Adjust or trigger on demand.

Choosing embedding models (C# focus)

  • Azure OpenAI: text-embedding-3-small (1536 dims), text-embedding-3-large (3072 dims); economical vs highest recall.
  • Vision/multimodal: Azure AI Vision multimodal embeddings (1024 dims) for image retrieval.
  • Keep model parity: the model used at indexing must match the model used at query time (via client‑side embeddings or an index vectorizer).

References: Supported models and vectorizers (Azure OpenAI, Vision, AML).

Two complementary patterns:

  1. Security filters (simple and GA)

Store user/group identifiers (for example, Entra object IDs) in a filterable string field like acl. At query time, filter to trim results.

var options = new SearchOptions { Filter = "search.in(acl, 'group-a,group-b', ',')" };
var results = await searchClient.SearchAsync<SearchDocument>("*", options);
  1. Native document-level ACL/RBAC (preview)

With preview APIs and SDKs, you can enable permission filters in the index and have AI Search enforce ACLs/RBAC at query time using the user token you pass in the request (x-ms-query-source-authorization). For ADLS Gen2 sources, the indexer can ingest ACL metadata automatically; for push, you include permission metadata in each document. See: Document‑level access control overview, RBAC enforcement, push/indexer guides.

Query with keys, managed identity, or OBO

Key (simple):

var clientWithKey = new SearchClient(endpoint, "docs-vector", new AzureKeyCredential("<query-or-admin-key>"));

Managed identity (recommended in Azure):

using Azure.Identity;

var clientMI = new SearchClient(endpoint, "docs-vector", new DefaultAzureCredential());

On‑Behalf‑Of (OBO) – delegate the user’s identity to Search:

using Azure.Identity;

var obo = new OnBehalfOfCredential(
  tenantId: "<tenant>",
  clientId: "<appId>",
  clientSecret: "<clientSecret>",
  userAssertion: "<user-access-token>");

var clientOBO = new SearchClient(endpoint, "docs-vector", obo);

Ensure your app and users have appropriate Azure roles (for example, Search Index Data Reader/Contributor) and that the index was created with permission filtering if you’re using native ACLs.

Bind your index to Semantic Kernel (C#)

Semantic Kernel has a first‑party Azure AI Search vector store connector.

// dotnet add package Microsoft.SemanticKernel.Connectors.AzureAISearch --prerelease
using Azure;
using Azure.Search.Documents.Indexes;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Connectors.AzureAISearch;

var builder = Kernel.CreateBuilder();
builder.Services.AddSingleton(new SearchIndexClient(endpoint, new AzureKeyCredential("<key>")));
builder.Services.AddAzureAISearchVectorStore();

var kernel = builder.Build();

// Or construct directly and target a specific collection (index)
var vectorStore = new AzureAISearchVectorStore(new SearchIndexClient(endpoint, new AzureKeyCredential("<key>")));
var collection = new AzureAISearchCollection<string, MyChunk>(
  new SearchIndexClient(endpoint, new AzureKeyCredential("<key>")),
  indexName: "docs-vector");

public sealed class MyChunk
{
  public string id { get; set; } = default!;
  public string content { get; set; } = default!;
  public float[] contentVector { get; set; } = default!;
  public string[]? acl { get; set; }
}

You can pair this with SK’s ITextEmbeddingGenerationService to vectorize queries, or configure an index vectorizer and send plain text queries that the service vectorizes for you.

Putting it together

  • Start small: create the index and push a few documents with embeddings.
  • Move file-based data into the pipeline (OCR + chunking + embeddings).
  • Add filters (department, customer, region) and document‑level permissions.
  • Wire the index to Semantic Kernel and your chat orchestration.

Once your retrieval is solid, iterate on prompt templates and grounding. Azure AI Search gives you hybrid retrieval, vector quality, and authorization—everything your C# RAG app needs.


Useful links: