Large Language Model (LLM)

What is Ollama, and How to Deploy It in an Enterprise Data Stack?

Last updated on
May 12, 2026

What is Ollama?

Ollama is a designed for seamless integration of large language models like Llama 2 into local environments. It stands out by its ability to package model weights, configurations, and essential data into a single, user-friendly module simplifying the often complex process of setting up and configuring these models, especially in terms of GPU optimization. This efficiency helps developers and researchers who need to run models locally without the hassle of intricate setups and makes working with advanced models more accessible.

Watch in action

No items found.

Why is Ollama better on Shakudo?

Ollama Deployment Runbook

Deployment Overview

When you deploy Ollama, the following Kubernetes objects are created:

Client → Istio Gateway → VirtualService → Ollama Service (port 11434) → Ollama Pod → PVC (model storage)
| Object | Purpose |
|--------|---------|
| **Deployment** | Manages the Ollama pod (standalone mode) |
| **PVC** | Persistent storage for downloaded models |
| **Service** | ClusterIP exposing port 11434 (REST) and 8080 (management) |
| **VirtualService** | Istio routing rule for external access |
| **ConfigMap** | Ollama server configuration |

Prerequisites

Before deploying Ollama, ensure you have:

  • Kubernetes cluster with Helm 3 installed
  • GPU nodes (if using GPU): NVIDIA device plugin or GPU Operator installed
  • Storage class available — standard-rwo or Longhorn
  • Istio/service mesh for external access and SSO
  • Node pool labeled appropriately (e.g., hyperplane.dev/nodeType: hyperplane-stack-component-pool)

Hardware Planning

Choose your hardware based on the models you plan to run:

| Target Model | Min GPU | Min VRAM | Min System RAM | PVC Size |
|-------------|---------|----------|----------------|----------|
| phi3:mini (3.8B) | None (CPU OK) | — | 8 GB | 20 GB |
| llama3.1:8b | T4 or L4 | 16 GB | 16 GB | 50 GB |
| mistral:7b | T4 or L4 | 16 GB | 16 GB | 50 GB |
| codellama:13b | A10G | 24 GB | 32 GB | 80 GB |
| llama3.1:70b | A100 | 80 GB | 128 GB | 200 GB |

GPU recommendations by use case:

  • Development/testing — CPU only or T4
  • Production (small models) — L4 or A10G
  • Production (large models) — A100

Step-by-Step Deployment

Step 1: Prepare Values File

Key points:

  • ollama.models.clean: false — Never enable this initially. It can delete models you've already pulled.
  • podLabels.sidecar.istio.io/inject: "true" — Without this, Istio won't inject the sidecar proxy, and external routing will fail.
  • updateStrategy.type: Recreate — Ollama uses a single pod with PVC. RollingUpdate won't work properly with PVC binding.

Step 2: Deploy with Helm

helm upgrade --install ollama <chart-path> \\
 -n hyperplane-ollama \\
 -f values.yaml \\
 --create-namespace \\
 --wait \\
 --timeout 10m

  • --upgrade --install — Installs if new, upgrades if existing
  • --wait — Blocks until all pods are ready
  • --timeout 10m — Fails if deployment doesn't complete in 10 minutes

Step 3: Configure Networking (Critical)

⚠️ This is a real issue we've hit in production. After a Helm upgrade, the VirtualService was missing, causing a 404 error. You may need to create it manually.

Ollama needs an Istio VirtualService to be accessible externally. Check if one exists:

kubectl get virtualservice -n hyperplane-ollama

If missing, create one:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
 name: ollama-vs
 namespace: hyperplane-ollama
 labels:
   app.kubernetes.io/name: ollama
   hyperplane-service-name: ollama
   hyperplane.dev/stack-component: ollama
spec:
 gateways:
 - hyperplane-istio/ingress-gateway   # Your gateway name
 hosts:
 - ollama.staging.canopyhub.io         # Your domain
 http:
 - match:
   - uri:
       prefix: /
   route:
   - destination:
       host: ollama                    # Must match the Service name
       port:
         number: 11434

Apply it:

kubectl apply -f ollama-virtualservice.yaml

Common mistakes:

  • Wrong gateway name — check with kubectl get gateway -n hyperplane-istio
  • Wrong destination host — must match the Kubernetes Service name (usually ollama)
  • Wrong port — Ollama API runs on 11434, not 8080

Step 4: Pull Models

After deployment, you need to pull at least one model before you can use Ollama.

Option A: Pull from within the pod

# Get the pod name
kubectl get pods -n hyperplane-ollama

# Pull a model
kubectl exec -n hyperplane-ollama <pod-name> -- ollama pull llama3.1

Option B: Pull via API

curl <http://ollama.hyperplane-ollama.svc.cluster.local:11434/api/pull> \\
 -d '{"name": "llama3.1"}'

Option C: Auto-pull via Helm values

ollama:
 models:
   pull:
     - llama3.1
     - mistral

Recommended first model: llama3.1:8b — good balance of quality and speed.

Step 5: Validate Deployment

Run these checks to confirm everything is working:

# 1. Pod is running and ready
kubectl get pods -n hyperplane-ollama

# 2. Service exists and points to the pod
kubectl get svc -n hyperplane-ollama

# 3. API responds
kubectl exec -n hyperplane-ollama <pod-name> -- curl -s <http://localhost:11434/api/version>

# 4. Model inference works
kubectl exec -n hyperplane-ollama <pod-name> -- \\
 curl -s <http://localhost:11434/api/generate> \\
 -d '{"model": "llama3.1", "prompt": "Hello", "stream": false}'

# 5. External access works (if VirtualService configured)
curl -s https://ollama.<your-domain>/api/version

# 6. Istio sidecar is injected
kubectl get pods -n hyperplane-ollama -o jsonpath='{.items[*].spec.containers[*].name}'
# Should show both "ollama" and "istio-proxy"

Success criteria:

  • ✅ Pod is Running (1/1 or 2/2 with Istio sidecar)
  • ✅ API version endpoint returns JSON
  • ✅ Model inference returns a response
  • ✅ External URL is accessible

GPU Deployment (Specifics)

For GPU-accelerated inference, add these to your values.yaml:

ollama:
 gpu:
   type: nvidia

# Resource requests for GPU
resources:
 limits:
   nvidia.com/gpu: 1

# Tolerations for GPU nodes
tolerations:
 - key: nvidia.com/gpu
   operator: Exists
   effect: NoSchedule

GPU setup checklist:

  • NVIDIA GPU Operator installed on the cluster
  • nvidia-device-plugin DaemonSet running on GPU nodes
  • Node pool labeled for GPU scheduling
  • Storage class available on GPU nodes (use standard-rwo if Longhorn is not present)
  • Runtime class set to nvidia if required by your cluster

DRA (Dynamic Resource Allocation): The Helm chart supports DRA for GPU allocation. This is a newer Kubernetes feature — leave it disabled unless your cluster explicitly supports it.

Upgrade Procedure

Based on our real upgrade from chart 1.18.0 (app 0.11.3) to chart 1.50.0 (app 0.17.7):

Before Upgrading

  1. Backup your values:

helm get values ollama -n hyperplane-ollama -o yaml > ollama-values-backup.yaml

  1. Document your model list:

kubectl exec -n hyperplane-ollama <pod-name> -- ollama list > ollama-models-backup.txt

  1. Save the deployment manifest:

kubectl get deployment ollama -n hyperplane-ollama -o yaml > ollama-deployment-backup.yaml

Execute Upgrade

# Dry run first
helm upgrade ollama <new-chart-path> \\
 -n hyperplane-ollama \\
 -f values.yaml \\
 --dry-run --debug

# If dry run looks good, execute
helm upgrade ollama <new-chart-path> \\
 -n hyperplane-ollama \\
 -f values.yaml \\
 --wait --timeout 15m

Post-Upgrade Checks

  • Verify pod reaches Ready state
  • Run ollama --version to confirm new version
  • Run ollama list to confirm models survived the upgrade
  • Test inference on at least one model
  • Check VirtualService — we've seen it go missing after upgrades

Rollback

helm rollback ollama -n hyperplane-ollama

Known Upgrade Gotchas

  • Recreate strategy = downtime — Ollama uses Recreate update strategy, so there's a brief outage during upgrade
  • Keep models.clean: false — Don't enable model cleanup during upgrades
  • VirtualService may disappear — Check and recreate if needed (see Step 3)

Why is better on Shakudo?

Core Shakudo Features

Own Your AI

Keep data sovereign, protect IP, and avoid vendor lock-in with infra-agnostic deployments.

Faster Time-to-Value

Pre-built templates and automated DevOps accelerate time-to-value.
integrate

Flexible with Experts

Operating system and dedicated support ensure seamless adoption of the latest and greatest tools.
See Shakudo in Action
Neal Gilmore
Get Started >