Ollama Integration | Deploy on Shakudo

Ollama Knowledge Base

Ollama Overview

Ollama is an open-source tool that lets you run large language models (LLMs) locally on your own infrastructure. Instead of sending data to external API providers, Ollama runs models like Llama, Mistral, and Phi directly on your servers — keeping your data private and your costs predictable.

For enterprises, this matters for three reasons:

Data privacy — No data leaves your environment. Prompts, responses, and documents never touch a third-party API.
Offline inference — Models run without internet access. Ideal for air-gapped or compliance-restricted environments.
No per-token costs — Once deployed, inference is free. No surprise API bills, no rate limits.

Ollama handles model downloading, loading, and serving through a simple REST API. It supports both CPU and GPU inference, and it's compatible with the OpenAI API format — so apps built for OpenAI can switch to Ollama by changing a single URL.

Key Features

One-command model running — Pull and run models with simple commands (ollama run llama3.1)
OpenAI-compatible API — Use existing OpenAI SDKs and tools with Ollama as the backend
GPU acceleration — NVIDIA GPU support for fast inference (A100, A10G, L4, T4)
CPU inference — Runs on CPU-only nodes for smaller models
Model management — Pull, list, remove, and customize models easily
REST API — Full API for integration with any application
Multi-model support — Run different models for different tasks on the same deployment
Kubernetes-native — Deploy via Helm chart with PVC storage, Istio integration, and GPU node support

Architecture

Ollama has a straightforward architecture:

Client → Ollama Server (port 11434) → Model Storage (PVC)

In a Kubernetes deployment:

The Ollama server loads models from PVC into memory (RAM or GPU VRAM) on demand, serves inference requests via the REST API, and keeps models loaded based on the KEEP_ALIVE setting.

Supported Models

**Note:** Models with "Q4" quantization are the default. Full-precision models require significantly more memory.

Ollama in the Shakudo Platform

When deployed through the Shakudo platform, Ollama is managed as a stack component. The platform handles:

Deployment — Helm chart-based deployment with proper resource allocation
Networking — Istio VirtualService for external access with SSO authentication
Storage — PVC provisioning for model persistence
GPU scheduling — Node selectors and tolerations for GPU node pools
Upgrades — Chart upgrades with rollback capability
Monitoring — Pod health, readiness probes, and log access

Ollama is typically accessible at https://ollama.<your-domain>/ with SSO authentication.

Running Your First Model - Getting Started

Step 1: Check Available Models

# Via CLI (inside the pod) kubectl exec -n hyperplane-ollama <pod-name> -- ollama list

Step 2: Pull a Model

If no models are listed, pull one:

# Via CLI kubectl exec -n hyperplane-ollama <pod-name> -- ollama pull llama3.1

Step 3: Run a Simple Prompt

# Via CLI kubectl exec -n hyperplane-ollama <pod-name> -- ollama run llama3.1 "What is machine learning?"

Essential Commands

‍`‍`OpenAI-Compatible Endpoint

Ollama supports the OpenAI API format at /v1/chat/completions:

from openai import OpenAI client = OpenAI( base_url="<http://localhost:11434/v1>", api_key="not-needed" ) response = client.chat.completions.create( model="llama3.1", messages=[{"role": "user", "content": "Hello, how are you?"}] ) print(response.choices[0].message.content)

Other Useful Endpoints

| Endpoint | Method | Purpose | |----------|--------|---------| | `/api/tags` | GET | List all downloaded models | | `/api/pull` | POST | Download a model | | `/api/version` | GET | Get Ollama version | | `/api/show` | POST | Show model details |

Using LangChain

from langchain_community.llms import Ollama llm = Ollama(base_url="<http://localhost:11434>", model="llama3.1") result = llm.invoke("What is the capital of France?") print(result)

Model Selection Guide

This section is for customers using Ollama as a managed component inside Shakudo. Start from the Shakudo platform instead of installing or exposing Ollama manually.

1. Access the component in Shakudo

Sign in to your Shakudo workspace with your organization-approved account.
Open the workspace or environment where this component is enabled.
Go to the Applications or component catalog area and select Ollama.
If you cannot see the component, ask your workspace administrator to confirm that it is enabled for your role and environment.

2. Open the component UI

Use the Shakudo-provided Open, Launch, or Access action for Ollama.
Let Shakudo handle authentication, networking, and workspace routing. Avoid using internal service URLs unless your administrator explicitly provides them.
Confirm that the component opens in the expected workspace before creating or changing resources.

3. Complete a first safe use case

Open the Ollama endpoint or UI exposed through Shakudo and run a small model test, such as a short completion or embedding request, using the model that your workspace administrator has enabled.

Use a small non-production example first, especially when testing credentials, scans, model calls, or data connections.
Name the test clearly so other workspace users can recognize it as a first-run validation.

4. Monitor and validate the result

Check the component UI for run status, logs, traces, scan results, job history, or project activity, depending on the component.
Return to Shakudo if you need platform-level status, access control changes, or administrator support.
Record any errors, missing permissions, or unexpected results before retrying with production workloads.

5. Next steps

Review the use cases, administration, and troubleshooting pages in this knowledge base for deeper examples.
For production usage, follow your team’s Shakudo workspace policies for credentials, data access, resource limits, and approvals.

Ollama Administration & Best Practices

Model Management

Pulling Models

Models can be pulled manually or automatically:

Manual pull (recommended for production):

kubectl exec -n hyperplane-ollama <pod-name> -- ollama pull llama3.1

Auto-pull via Helm values:

ollama: models: pull: - llama3.1 - mistral

Tip: In production, prefer manual pulls. Auto-pull runs on every pod start, which slows startup and can cause failures if the registry is unreachable.

Listing, Inspecting, and Removing Models

# List all downloaded models ollama list # Show model details (size, format, parameters) ollama show llama3.1 # Remove unused models to free PVC space ollama rm phi3:mini

Custom Models with Modelfile

FROM llama3.1 PARAMETER temperature 0.3 PARAMETER num_predict 512 SYSTEM You are a helpful technical assistant. Be concise and accurate. ollama create my-assistant -f Modelfile ollama run my-assistant "What is Kubernetes?"

Model Storage and PVC Sizing

| Models to Store | Recommended PVC Size | |----------------|---------------------| | 1–2 small models (7B–8B) | 30 GB | | 3–5 mixed models | 100 GB | | Large models (34B–70B) | 200 GB |

GPU Configuration

Enabling GPU

ollama: gpu: type: nvidia resources: limits: nvidia.com/gpu: 1 tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule

GPU Comparison

| GPU | VRAM | Good For | Approx. Speed (8B model) | |-----|------|----------|--------------------------| | T4 | 16 GB | Dev/test, small models | ~30 tokens/sec | | L4 | 24 GB | Production, moderate load | ~40 tokens/sec | | A10G | 24 GB | Production, good throughput | ~50 tokens/sec | | A100 | 80 GB | Large models, heavy load | ~80+ tokens/sec |

Monitoring GPU

nvidia-smi # Inside pod or on node

DRA (Dynamic Resource Allocation): The Helm chart supports it, but leave disabled unless your cluster explicitly supports it.

Networking & Security

Service Exposure

| Method | Use Case | |--------|----------| | **ClusterIP + VirtualService** | Default — expose via Istio with SSO | | **ClusterIP + port-forward** | Development and debugging | | **NodePort** | Direct access without Istio (not recommended for production) |

Authentication

SSO via Keycloak — Handled by the platform's OAuth2 proxy
API key — Ollama doesn't require API keys by default; authentication is handled at the gateway level
Network policies — Restrict access to the Ollama service from specific namespaces

Istio Sidecar

The Istio sidecar is required for external routing. Verify injection:

kubectl get pods -n hyperplane-ollama -o jsonpath='{.items[*].spec.containers[*].name}' # Should show "ollama" and "istio-proxy"

If the sidecar is missing, add to values.yaml:

podLabels: sidecar.istio.io/inject: "true"

Performance Tuning

Key Environment Variables

| Variable | Default | Purpose | |----------|---------|---------| | `OLLAMA_NUM_PARALLEL` | 1 | Number of parallel request sequences | | `OLLAMA_MAX_LOADED_MODELS` | 1 | Max models loaded in memory simultaneously | | `OLLAMA_KEEP_ALIVE` | 5m | How long to keep models loaded after last request | | `OLLAMA_MAX_VRAM` | 0 (auto) | Max VRAM to use (0 = all available) |

Tuning Recommendations

Increase OLLAMA_NUM_PARALLEL if you need to handle concurrent requests (requires more VRAM)
Increase OLLAMA_KEEP_ALIVE (e.g., 24h) to avoid cold-start delays on frequently used models
Set OLLAMA_MAX_VRAM if you need to reserve GPU memory for other workloads
Use smaller models for latency-sensitive applications

Monitoring & Observability

Health Check

curl <http://localhost:11434/api/version>

Key Metrics to Monitor

GPU utilization — nvidia-smi or DCGM metrics
Memory usage — Pod memory consumption vs limits
Request latency — Time to first token and total response time
Error rate — Failed inference requests

Log Review

kubectl logs -n hyperplane-ollama <pod-name> --tail=100 kubectl logs -n hyperplane-ollama <pod-name> | grep -i error

Upgrades & Maintenance

Upgrade Process

Backup values, model list, and deployment manifest
Dry run: helm upgrade --dry-run --debug
Execute: helm upgrade --wait --timeout 15m
Validate: check pod, version, models, inference

Key Points

Recreate strategy = brief downtime — Plan accordingly
Keep models.clean: false — Never enable cleanup during upgrades
Check VirtualService — May disappear after upgrades (known issue)
Rollback available — helm rollback ollama -n hyperplane-ollama

Scaling Considerations

Vertical scaling — Move to a larger GPU (T4 → A10G → A100) for better performance
Horizontal scaling — Deploy multiple Ollama instances behind a load balancer
Model sharding — 70B+ models can be split across multiple GPUs
Dedicated GPU nodes — Isolate Ollama on its own node pool to avoid resource contention

Ollama Troubleshooting & FAQ

Common Issues

Model Not Loading

Problem: Model fails to load or "model not found" error.

What to check:

Run ollama list — is the model listed?
Check PVC disk space — df -h inside the pod
Verify the model name spelling (e.g., llama3.1 not llama-3.1)
Check pod logs for loading errors

Fix:

Pull the model again: ollama pull llama3.1
Free disk space by removing unused models: ollama rm <unused-model>
Use the exact model name from ollama list

Slow Performance / Inference

Problem: Model responses are very slow (seconds per token).

What to check:

Is GPU being used? Run nvidia-smi to check
Which model size are you running? (70B on CPU will be extremely slow)
How many concurrent requests? Check OLLAMA_NUM_PARALLEL
Check pod memory usage — may be swapping to disk

Fix:

Enable GPU if available (see GPU Configuration in Admin guide)
Use a smaller model (switch from 70B to 8B)
Reduce OLLAMA_NUM_PARALLEL to 1
Increase pod memory limits
Set OLLAMA_KEEP_ALIVE to keep models loaded (avoids cold start)

Out of Memory Errors

Problem: Pod is OOMKilled or returns "out of memory" error.

What to check:

Pod resource limits vs model size
How many models are loaded simultaneously
GPU VRAM utilization with nvidia-smi

Fix:

Use a smaller model (8B instead of 70B)
Increase memory/VRAM limits in values.yaml
Set OLLAMA_MAX_LOADED_MODELS: 1 to limit concurrent model loading
Set OLLAMA_MAX_VRAM to prevent Ollama from using all GPU memory
Reduce OLLAMA_NUM_PARALLEL

API Not Responding / 404 Error

Problem: curl to Ollama API returns connection refused, 404, or white screen.

What to check:

Is the pod running? kubectl get pods -n hyperplane-ollama
Does the service exist? kubectl get svc -n hyperplane-ollama
Does the VirtualService exist? kubectl get virtualservice -n hyperplane-ollama
Is the port correct? (Should be 11434, not 8080)

Fix:

If pod is not running: check logs with kubectl logs
If VirtualService is missing: this is a known issue after upgrades — create it manually:

apiVersion: networking.istio.io/v1 kind: VirtualService metadata: name: ollama-vs namespace: hyperplane-ollama spec: gateways: - hyperplane-istio/ingress-gateway hosts: - ollama.<your-domain> http: - match: - uri: prefix: / route: - destination: host: ollama port: number: 11434

If port is wrong: ensure VirtualService routes to port 11434

GPU Not Being Used

Problem: Ollama runs on CPU even though GPU nodes are available.

What to check:

Run nvidia-smi on the GPU node — is it functional?

Large Language Model Llm

What is Ollama, and How to Deploy It in an Enterprise Data Stack?

Ollama

What is Ollama?

Watch in action

Read more about Ollama

How to Deploy AI Agents on Kubernetes

Self-Hosted AI Agents for Enterprise: What n8n, Ollama, and DIY Stacks Won't Tell You

How to Deploy AI Agents On-Premise Without Building From Scratch

Why is Ollama better on Shakudo?

Ollama Knowledge Base

Ollama Overview

Key Features

Architecture

Supported Models

Ollama in the Shakudo Platform

Running Your First Model - Getting Started

Step 1: Check Available Models

Step 2: Pull a Model

Step 3: Run a Simple Prompt

Essential Commands

‍‍OpenAI-Compatible Endpoint

Other Useful Endpoints

Using LangChain

Model Selection Guide

1. Access the component in Shakudo

2. Open the component UI

3. Complete a first safe use case

4. Monitor and validate the result

5. Next steps

Ollama Administration & Best Practices

Model Management

Pulling Models

Listing, Inspecting, and Removing Models

Custom Models with Modelfile

Model Storage and PVC Sizing

GPU Configuration

Enabling GPU

GPU Comparison

Monitoring GPU

Networking & Security

Service Exposure

Authentication

Istio Sidecar

Performance Tuning

Key Environment Variables

Tuning Recommendations

Monitoring & Observability

Health Check

Key Metrics to Monitor

Log Review

Upgrades & Maintenance

Upgrade Process

Key Points

Scaling Considerations

Ollama Troubleshooting & FAQ

Common Issues

Model Not Loading

Slow Performance / Inference

Out of Memory Errors

API Not Responding / 404 Error

GPU Not Being Used

Why is Ollama better on Shakudo?

Why is Ollama better on Shakudo?

Core Shakudo Features

Own Your AI

Faster Time-to-Value

Flexible with Experts

‍`‍`OpenAI-Compatible Endpoint