This guide shows how to run the Ollama server on a Kubernetes node with an NVIDIA GPU. It assumes the NVIDIA device plugin is installed on the cluster.
FluxCD manages the Ollama deployment under gitops/clusters/homelab/apps/ollama/.
Commit the manifest files to the repository and Flux will create the namespace,
deployment and service automatically. The deployment mounts an empty directory at
/root/.ollama for model storage. Replace it with a persistent volume claim if you want
the models to survive pod restarts.
-
Pull a model into the running pod:
kubectl exec deployment/ollama-gpu -- ollama pull qwen3:4b -
Verify the model is available:
kubectl exec deployment/ollama-gpu -- ollama list -
Send a test request to the service:
curl http://<service-ip>:80/api/generate -d '{"model":"qwen3:4b","prompt":"Hello"}'
To swap models (pull new, remove old, verify GPU):
scripts/ollama/update-model.sh qwen3:4b qwen2.5:3bAdd --ha to also update the Home Assistant conversation agent.