AIIntermediate45 min

Host LLM in Kubernetes

Learn how to host large language models in Kubernetes using Ollama, chat with them via Open-WebUI, and use K8sGPT to debug your cluster with AI.

Overview

In this workshop, you'll deploy a complete AI stack on Kubernetes using Kuberise.io:

Ollama — serve LLMs (like DeepSeek, Llama) via a REST API
Open-WebUI — a ChatGPT-like web interface for your self-hosted models
K8sGPT — an AI assistant that diagnoses Kubernetes cluster issues

All three tools are available as Kuberise.io platform components and can be enabled with a single configuration change.

Part 1: Ollama — Self-Hosted LLM Server

Enable Ollama

In your enabler file, enable the Ollama component:

# app-of-apps/values-{name}.yaml (e.g. values-shared.yaml)
ArgocdApplications:
  ollama:
    enabled: true

Try It Locally First

Install the Ollama CLI on your machine and run a model locally:

ollama run deepseek-r1:1.5b

Example prompt: "Write a love letter from a smartphone to its charger"

Try Your Kubernetes-Hosted Model

Point the Ollama CLI at your cluster's Ollama instance:

OLLAMA_HOST=https://ollama.onprem.kuberise.dev ollama run deepseek-r1:1.5b

Compare and Observe

Compare the speed — is local inference faster or slower than the cluster?
Check resource usage — open the Grafana dashboard to see CPU and memory consumption of the Ollama pod

Docker Desktop for Mac does not expose the host GPU to containers. Models will run on CPU only in local Docker-based clusters.

Configuration Options

You can fine-tune Ollama's deployment through the values file:

Configuration	Description
`replicaCount`	Scale horizontally for higher throughput
GPU resources	Enable GPU passthrough for faster inference
Persistent Volume	Avoid re-downloading models on pod restart

Working with Multiple Models

Pull models: Ollama can host multiple models simultaneously
Switch models: Use the /show info command in the Ollama CLI to verify which model is running
GitOps workflow: Change the model in your values file, push to Git, and ArgoCD automatically restarts the pod with the new model

For laptop-friendly workshops, run a single small model (like deepseek-r1:1.5b) to conserve resources. You can switch to larger models in production clusters with more capacity.

Part 2: Open-WebUI — Chat Interface

Enable Open-WebUI

# app-of-apps/values-{name}.yaml (e.g. values-shared.yaml)
ArgocdApplications:
  open-webui:
    enabled: true

Access the Chat Interface

Check that ArgoCD has deployed Open-WebUI successfully
Verify a new Ingress resource has been created
Open the URL: https://webui.onprem.kuberise.dev
Select the model you want to chat with from the dropdown
Start a conversation — your chat history is saved automatically

Open-WebUI provides a familiar ChatGPT-like experience, but your data stays on your infrastructure. This is especially valuable for:

Data privacy — sensitive prompts never leave your cluster
Cost control — no per-token API charges
Customization — fine-tune models for your specific use cases

Part 3: K8sGPT — AI-Powered Cluster Diagnostics

K8sGPT connects your Kubernetes cluster to an LLM to automatically diagnose issues and provide human-readable explanations.

Enable K8sGPT

# app-of-apps/values-{name}.yaml (e.g. values-shared.yaml)
ArgocdApplications:
  k8sgpt:
    enabled: true

Configure the K8sGPT Custom Resource

Ensure the model specified in the K8sGPT CR matches a model running in your Ollama instance:

kind: K8sGPT
metadata:
  name: k8sgpt-ollama
spec:
  ai:
    enabled: true
    model: llama3.2:3b
    backend: localai
    baseUrl: http://ollama.ollama.svc.cluster.local:11434/

Enable the Ollama backend in the K8sGPT values:

enable_ollama: true

Test It

Check existing results — K8sGPT creates Result resources in the k8sgpt namespace for any issues it detects:

kubectl get results -n k8sgpt

Create a faulty pod to trigger a diagnostic:

kubectl run faulty-pod --image=fakeimage:latest

Watch K8sGPT detect the issue — a new Result resource will appear, containing:
- The error description
- An AI-generated explanation of the problem
- Suggested remediation steps

kubectl get results -n k8sgpt -o yaml

Clean up and verify the report is removed:

kubectl delete pod faulty-pod

K8sGPT continuously monitors your cluster and provides actionable insights — like having an AI SRE watching over your infrastructure.

Key Takeaways

Ollama makes it easy to self-host LLMs on Kubernetes with minimal configuration
Open-WebUI provides a polished chat interface for team-wide access to AI
K8sGPT turns your LLM into a Kubernetes debugging assistant
All three integrate seamlessly with the GitOps workflow — enable, commit, push, done
Self-hosting AI gives you data privacy, cost predictability, and full control

Policy Management with Kyverno

Learn how to use Kyverno to validate, mutate, and generate Kubernetes resources — enforcing security and compliance policies as code.