AIIntermediate45 min

Host LLM in Kubernetes

Learn how to host large language models in Kubernetes using Ollama, chat with them via Open-WebUI, and use K8sGPT to debug your cluster with AI.

Overview

In this workshop, you'll deploy a complete AI stack on Kubernetes using Kuberise.io:

  1. Ollama — serve LLMs (like DeepSeek, Llama) via a REST API
  2. Open-WebUI — a ChatGPT-like web interface for your self-hosted models
  3. K8sGPT — an AI assistant that diagnoses Kubernetes cluster issues

All three tools are available as Kuberise.io platform components and can be enabled with a single configuration change.

Part 1: Ollama — Self-Hosted LLM Server

Enable Ollama

In your enabler file, enable the Ollama component:

# app-of-apps/values-{name}.yaml (e.g. values-shared.yaml)
ArgocdApplications:
  ollama:
    enabled: true

Try It Locally First

Install the Ollama CLI on your machine and run a model locally:

ollama run deepseek-r1:1.5b

Example prompt: "Write a love letter from a smartphone to its charger"

Try Your Kubernetes-Hosted Model

Point the Ollama CLI at your cluster's Ollama instance:

OLLAMA_HOST=https://ollama.onprem.kuberise.dev ollama run deepseek-r1:1.5b

Compare and Observe

  • Compare the speed — is local inference faster or slower than the cluster?
  • Check resource usage — open the Grafana dashboard to see CPU and memory consumption of the Ollama pod
Docker Desktop for Mac does not expose the host GPU to containers. Models will run on CPU only in local Docker-based clusters.

Configuration Options

You can fine-tune Ollama's deployment through the values file:

ConfigurationDescription
replicaCountScale horizontally for higher throughput
GPU resourcesEnable GPU passthrough for faster inference
Persistent VolumeAvoid re-downloading models on pod restart

Working with Multiple Models

  • Pull models: Ollama can host multiple models simultaneously
  • Switch models: Use the /show info command in the Ollama CLI to verify which model is running
  • GitOps workflow: Change the model in your values file, push to Git, and ArgoCD automatically restarts the pod with the new model
For laptop-friendly workshops, run a single small model (like deepseek-r1:1.5b) to conserve resources. You can switch to larger models in production clusters with more capacity.

Part 2: Open-WebUI — Chat Interface

Enable Open-WebUI

# app-of-apps/values-{name}.yaml (e.g. values-shared.yaml)
ArgocdApplications:
  open-webui:
    enabled: true

Access the Chat Interface

  1. Check that ArgoCD has deployed Open-WebUI successfully
  2. Verify a new Ingress resource has been created
  3. Open the URL: https://webui.onprem.kuberise.dev
  4. Select the model you want to chat with from the dropdown
  5. Start a conversation — your chat history is saved automatically

Open-WebUI provides a familiar ChatGPT-like experience, but your data stays on your infrastructure. This is especially valuable for:

  • Data privacy — sensitive prompts never leave your cluster
  • Cost control — no per-token API charges
  • Customization — fine-tune models for your specific use cases

Part 3: K8sGPT — AI-Powered Cluster Diagnostics

K8sGPT connects your Kubernetes cluster to an LLM to automatically diagnose issues and provide human-readable explanations.

Enable K8sGPT

# app-of-apps/values-{name}.yaml (e.g. values-shared.yaml)
ArgocdApplications:
  k8sgpt:
    enabled: true

Configure the K8sGPT Custom Resource

Ensure the model specified in the K8sGPT CR matches a model running in your Ollama instance:

kind: K8sGPT
metadata:
  name: k8sgpt-ollama
spec:
  ai:
    enabled: true
    model: llama3.2:3b
    backend: localai
    baseUrl: http://ollama.ollama.svc.cluster.local:11434/

Enable the Ollama backend in the K8sGPT values:

enable_ollama: true

Test It

  1. Check existing results — K8sGPT creates Result resources in the k8sgpt namespace for any issues it detects:
kubectl get results -n k8sgpt
  1. Create a faulty pod to trigger a diagnostic:
kubectl run faulty-pod --image=fakeimage:latest
  1. Watch K8sGPT detect the issue — a new Result resource will appear, containing:
    • The error description
    • An AI-generated explanation of the problem
    • Suggested remediation steps
kubectl get results -n k8sgpt -o yaml
  1. Clean up and verify the report is removed:
kubectl delete pod faulty-pod

K8sGPT continuously monitors your cluster and provides actionable insights — like having an AI SRE watching over your infrastructure.

Key Takeaways

  1. Ollama makes it easy to self-host LLMs on Kubernetes with minimal configuration
  2. Open-WebUI provides a polished chat interface for team-wide access to AI
  3. K8sGPT turns your LLM into a Kubernetes debugging assistant
  4. All three integrate seamlessly with the GitOps workflow — enable, commit, push, done
  5. Self-hosting AI gives you data privacy, cost predictability, and full control