Chat with DeepSeek locally and on Azure AI Foundry

Chat with DeepSeek locally and on Azure AI Foundry

I like DeepSeek’s capabilities, but I care a lot about data boundaries. Personally, I only run models locally or in cloud regions I select (Europe or the US). I don’t use endpoints hosted in jurisdictions I don’t control. This post shows how to use DeepSeek in two ways: locally (OpenAI‑compatible local servers) and in Azure AI Foundry (Azure AI Studio) with a strong focus on region and network controls.

Used model for examples: DeepSeek R1 Distill (Llama 8B)

1) Run DeepSeek locally (OpenAI‑compatible server)

There are several ways to run models locally (for example via OpenAI‑compatible servers like LM Studio or community runtimes). Model availability depends on licenses and distribution channels; always obtain weights from a trusted source and verify the license allows local inference.

Common local pattern (Node/TS): point the OpenAI client at your local server with a base URL and a dummy key.

import OpenAI from "openai";

// Example local server: http://localhost:1234/v1 (LM Studio and similar)
// Some runtimes may expose http://localhost:11434/v1 or another base URL.
const client = new OpenAI({
  apiKey: process.env.LOCAL_API_KEY || "sk-local", // often ignored locally
  baseURL: process.env.LOCAL_BASE_URL || "http://localhost:1234/v1"
});

const model = process.env.LOCAL_MODEL || "<your local tag for DeepSeek R1 Distill (Llama 8B)>";

const response = await client.chat.completions.create({
  model: model, // ensure this matches your local runtime's model tag/name
  messages: [
    { role: "system", content: "You are a concise, helpful assistant." },
    { role: "user", content: "Give me two prompt-writing tips." }
  ],
  stream: true
});

let full = "";
for await (const part of response) {
  const delta = part.choices?.[0]?.delta?.content ?? "";
  if (delta) {
    full += delta;
    process.stdout.write(delta);
  }
}

Semantic Kernel (C#) using OpenAI‑compatible local endpoint

If your local runtime exposes an OpenAI‑style API, you can wire it into SK without changing the rest of your app.

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

var builder = Kernel.CreateBuilder()
  .AddOpenAIChatCompletion(
    modelId: Environment.GetEnvironmentVariable("LOCAL_MODEL") ?? "<your local tag for DeepSeek R1 Distill (Llama 8B)>",
    apiKey: Environment.GetEnvironmentVariable("LOCAL_API_KEY") ?? "sk-local",
    endpoint: new Uri(Environment.GetEnvironmentVariable("LOCAL_BASE_URL") ?? "http://localhost:1234/v1")
  );

Kernel kernel = builder.Build();
var chat = kernel.GetRequiredService<IChatCompletionService>();
var history = new ChatHistory();
history.AddSystemMessage("You are a concise, helpful assistant.");
history.AddUserMessage("Give me two prompt-writing tips.");

await foreach (var chunk in chat.GetStreamingChatMessageContentsAsync(history, kernel: kernel))
{
  if (!string.IsNullOrEmpty(chunk.Content)) Console.Write(chunk.Content);
}

Notes:

  • Use a model name that matches your local runtime’s tag. Some local servers let you alias models.
  • Streaming behavior and token accounting are identical to other chat models. See the “How streaming works” section in the previous post.

2) Azure AI Foundry (Azure AI Studio) approach

DeepSeek is not part of Azure OpenAI. Instead, check the Azure AI Model Catalog for availability in your region, or bring your own container if model weights and licensing permit. Two main paths:

  1. If DeepSeek appears in the Model Catalog for your desired region (EU/US), deploy a real‑time serverless endpoint. You’ll get an endpoint URL and key/AAD auth. Lock it down with Private Link and disable public network access.
  2. If not available in the catalog, deploy your own inference server container (AKS or Azure Container Apps) and place it behind an internal endpoint or Application Gateway. Authenticate with Managed Identity and keep all data in‑region.

Key data‑boundary practices:

  • Create resources in EU/US regions only. Validate region for storage, logs, and monitoring sinks.
  • Disable data logging and content capture where possible; review diagnostic settings.
  • Use Private Link and deny public network access on endpoints.
  • Prefer Managed Identity over static keys; scope roles minimally.

Example: Calling with Azure OpenAI client (Node/TS)

import { OpenAIClient, AzureKeyCredential } from "@azure/openai";

const endpoint = process.env.AZURE_OPENAI_ENDPOINT!; // e.g., https://<resource>.openai.azure.com
const deployment = process.env.AZURE_OPENAI_DEPLOYMENT!; // your deployment name
const client = new OpenAIClient(endpoint, new AzureKeyCredential(process.env.AZURE_OPENAI_API_KEY!));

const events = await client.getChatCompletions(
  deployment,
  [
    { role: "system", content: "You are a concise, helpful assistant." },
    { role: "user", content: "Give me two prompt-writing tips." }
  ],
  { stream: true }
);

for await (const event of events) {
  for (const choice of event.choices ?? []) {
    process.stdout.write(choice?.delta?.content ?? "");
  }
}

Note: DeepSeek isn’t offered via Azure OpenAI. Use the Azure OpenAI client for Azure OpenAI models only. For DeepSeek endpoints created through Azure AI Foundry/Model Catalog or your own containers, use the Azure AI Inference SDK or direct REST for that endpoint type (pattern similar to above, but different SDK/base URL).

Streaming and token accounting (recap)

Streaming sends partial tokens as they’re generated for fast first‑token latency. The final response usually includes usage (prompt/completion/total tokens). For local servers that don’t return usage, estimate with a tokenizer and reconcile after completion. See the streaming section in the previous post for detailed examples.

Troubleshooting

  • 401/403: Wrong key/AAD permissions or calling the wrong endpoint/region.
  • 404: Using a raw model name where your endpoint expects a deployment name (or vice versa).
  • Slow local responses: use quantized weights, reduce context, and avoid running multiple models concurrently.
  • Region/data boundary: double‑check resource groups, storage accounts, and logs are confined to EU/US.

References