Chat with Localhost

Chat with Localhost

Build conversational AI with streaming and context

This post kicks off the chat series: how to build a conversational experience with large language models using chat completion, streaming token-by-token responses, and clean context (chat history) management.

We’ll start local with Ollama as the LLM runtime and wire it up to Semantic Kernel. Running locally is great for quick iteration, privacy, and offline demos. Later we can swap the model/service without rewriting the app.

Local hardware notes — my setup and experience

Short summary of my reference hardware and real-world experience running local LLMs:

  • MacBook Pro (M3 Max, 64 GB unified RAM): runs 8B models (quantized) smoothly for single-model inference. Great developer experience and fast streaming response rendering.
  • Surface Pro for Business (Copilot + PC Snapdragon X Elite, 12-core, 32 GB): also handles 8B models well for single-process use, though CPU-bound tasks and Windows-on-ARM variations can make performance slightly lower than native macOS on M3.
  • In practice I run one model at a time. For 8B–13B class models this hardware gives very usable response times for business scenarios (chat, summarization, retrieval‑augmented generation). Larger models (>70B) significantly increase memory and compute needs and reduce throughput unless you have GPU acceleration or a large server.

Bottom line: for many business scenarios smaller models (8B range) are an excellent trade-off: fast, cheaper, and often good enough for production prompts. Keep larger models for tasks that absolutely require extra capability.

What “chat completion” actually means

  • A chat service accepts a list of messages (system, user, assistant) and returns the next assistant message.
  • “Context management” is about how you store, trim, and serialize that history so the model has enough memory without blowing the token budget.
  • “Streaming” lets you render partial tokens as they’re generated for a snappy UX.

Docs:

Prerequisites (local-first)

  1. Install Ollama (macOS/Linux/Windows). Then pull a model, for example llama4.
  2. Ensure the Ollama server is running on http://localhost:11434 (default).
  3. In your C# project, add Semantic Kernel + the Ollama connector.
dotnet add package Microsoft.SemanticKernel
dotnet add package Microsoft.SemanticKernel.Connectors.Ollama

Install and start Ollama (quick)

macOS (Homebrew):

# Install (if you use Homebrew)
brew install ollama

# Pull a model (example)
ollama pull llama4

# Run the model (this will start the local server and attach to the model)
ollama run llama4

# List downloaded models
ollama list

Windows (general):

  1. Download the Ollama installer from the official site for Windows (or use winget/choco if available).
  2. Pull and run the model similarly: ollama pull <model> and ollama run <model>.

The local server listens on http://localhost:11434 by default. Point your Semantic Kernel connector at that URI.

Reference APIs:

Minimal chat with streaming (C#)

using System;
using System.Threading.Tasks;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

// The Ollama connector is experimental at the time of writing
#pragma warning disable SKEXP0070

class Program
{
  static async Task Main()
  {
    // 1) Create the kernel and add Ollama chat completion
    var builder = Kernel.CreateBuilder()
      .AddOllamaChatCompletion(
        modelId: "llama4",                    // any model you've pulled with Ollama
        endpoint: new Uri("http://localhost:11434")
      );

    Kernel kernel = builder.Build();

    // 2) Get the chat completion service
    var chat = kernel.GetRequiredService<IChatCompletionService>();

    // 3) Prepare chat history (context)
    var history = new ChatHistory();
    history.AddSystemMessage("You are a concise, helpful assistant.");
    history.AddUserMessage("Give me two tips for writing better prompts.");

    // 4) Stream the response, chunk-by-chunk
    await foreach (var chunk in chat.GetStreamingChatMessageContentsAsync(history, kernel: kernel))
    {
      // Each chunk is a StreamingChatMessageContent; write only the text part
      if (!string.IsNullOrEmpty(chunk.Content))
      {
        Console.Write(chunk.Content);
      }
    }

    Console.WriteLine();
  }
}

What’s happening

  • We add the Ollama chat connector to the kernel and point it at http://localhost:11434.
  • We use ChatHistory to keep context (system + user messages). Add turns as your UI collects them.
  • We call GetStreamingChatMessageContentsAsync(...) to produce incremental chunks and render them live.

Notes on the SK connector and local behavior

  • The Ollama connector in Semantic Kernel behaves like any other chat completion service: register it once and call chat APIs (streaming or non‑streaming) through the Kernel abstraction.
  • Local inference is CPU (or Apple Neural Engine / optimized native) bound. Expect lower throughput than a GPU‑backed cloud endpoint but much lower latency for small prompts because of no network hop.
  • For multi‑user or production scenarios, consider a managed endpoint (Azure AI Model Catalog or Azure OpenAI) to scale horizontally and manage authentication.

Managing context like a pro

As conversations grow, token limits bite. A few practical patterns:

  • Keep a thin system message to set behavior (“concise, helpful”).
  • Summarize older turns periodically and replace them with a brief recap.
  • Keep only the last N turns verbatim, plus a rolling summary.
  • Consider role-based memory: user profile, session goals, temporary scratch notes.

In code, you can prune history before each call:

// Pseudo-code: keep last 6 messages + 1 summary
while (history.Count > 7)
{
  history.RemoveAt(1); // remove oldest user/assistant pair after the system message
}

Or, maintain a separate “summary” string you refresh every few turns and inject as an assistant message before the latest question.

Swapping models and services later

One of SK’s strengths is easy service swapping. If you decide to move from local to cloud, you can register Azure OpenAI, OpenAI, or others with minimal code changes. See: https://learn.microsoft.com/semantic-kernel/concepts/ai-services/

Troubleshooting local Ollama

Model size, RAM and practical differences (8B model on 64 GB vs smaller hosts)

People often confuse “model size” (for example 8B = ~8 billion parameters) with the exact memory footprint on disk or in RAM. A few practical facts and heuristics:

  • Model parameters vs memory: the raw parameter count doesn’t map 1:1 to RAM. Frameworks store parameters in precisions like FP16, FP32, or quantized 8-bit/4-bit. FP16 will use roughly half the bytes of FP32 for parameters.
  • Disk vs runtime memory: models are stored on disk (tens of GB for large models). Loading into memory requires more than the compressed disk size because of weights, optimizer/workspace buffers, and activations during inference.
  • Quantization matters: 4-bit or 8-bit quantization drastically reduces RAM needs and is standard for running 8B–13B models locally. Many 8B models are available pre-quantized or via tooling that converts them.
  • Unified memory on Apple Silicon: the M3 Max’s unified memory model reduces data copying between CPU and the Neural Engine and generally improves throughput compared to discrete CPU/RAM/GPU setups.

Rough guidance (very approximate):

  • Running an 8B model (quantized to 4/8-bit) comfortably fits on a 32 GB machine for single-process use. On a 64 GB machine you get extra headroom for larger context windows and additional processes.
  • Running a full 8B model in FP16 without quantization may require more RAM (doable on 64 GB but may be tight depending on context length and framework).
  • Larger models (30B, 70B+) will typically require GPU acceleration or a machine with much more memory; they’re not ideal for single‑machine local inference without specialized hardware.

Practical tips to make local LLMs pleasant:

  • Use quantized model builds where possible (4-bit/8-bit). They reduce RAM and often preserve acceptable quality.
  • Limit the context window you send; load long documents into a retriever and send only the relevant chunks.
  • Run one model process at a time on machines without GPUs. Multiple simultaneous models will compete for RAM and CPU.
  • Monitor memory with htop / Activity Monitor and watch for swap use — heavy swapping kills performance.

Troubleshooting local Ollama

  • Make sure Ollama is running and the model is pulled: ollama run llama4 once, then stop it.
  • If you changed the port or host, update the endpoint URI.
  • For very small models, keep prompts short and focused for better quality.

References