This runbook is based on what actually worked in staging. It is not a generic ideal-state guide.
This runbook covers the Helm-based deployment pattern that succeeded in staging and the checks that proved the cluster was healthy afterward.
The successful staging flow used:
stack-components/flink/helmhelm dependency build before upgradehelm upgrade --install with both the base values file and the environment patchBefore deployment, confirm you have:
From the Flink chart directory:
cd stack-components/flink/helm
helm dependency build
Expected result:
In staging, the base values pointed to the wrong node pool. The working fix was to keep the main values file and add a second patch file.
Example pattern:
jobmanager:
nodeSelector:
hyperplane.dev/nodeType: hyperplane-system-pool
taskmanager:
nodeSelector:
hyperplane.dev/nodeType: hyperplane-system-pool
Why this mattered:
valuesOverride.yaml pointed to hyperplane-stack-component-poolhyperplane-system-poolBefore any upgrade, capture the live release state.
Example pattern:
kubectl --kubeconfig "$KUBECONFIG" --context "$CTX" \\
get deploy,sts,svc -n "$NS" -l app.kubernetes.io/instance="$RELEASE" -o yaml \\
> /tmp/flink-release-preupgrade.yaml
This gives you a clean rollback reference in addition to Helm history.
Use the same files you plan to deploy.
helm --kubeconfig "$KUBECONFIG" --kube-context "$CTX" \\
upgrade --install "$RELEASE" . \\
--namespace "$NS" \\
--values valuesOverride.yaml \\
--values staging-node-selector-patch.yaml \\
--dry-run --debug
Check that:
nodeSelector matches the real environmentThe staging deployment that worked used this pattern:
helm --kubeconfig "$KUBECONFIG" --kube-context "$CTX" \\
upgrade --install "$RELEASE" . \\
--namespace "$NS" --create-namespace \\
--values valuesOverride.yaml \\
--values staging-node-selector-patch.yaml \\
--wait --atomic --timeout 15m
What success looked like in staging:
deployed150.0.5 and app version was 2.1.0Use a mix of Helm, Kubernetes, and Flink-native checks.
helm --kubeconfig "$KUBECONFIG" --kube-context "$CTX" status "$RELEASE" -n "$NS"
kubectl --kubeconfig "$KUBECONFIG" --context "$CTX" get pods,deploy,sts,svc -n "$NS"
kubectl --kubeconfig "$KUBECONFIG" --context "$CTX" port-forward -n "$NS" svc/flink-jobmanager 8081:8081
curl -s <http://127.0.0.1:8081/overview>
curl -s <http://127.0.0.1:8081/taskmanagers>
Healthy staging signals were:
/overview returning valid JSON/taskmanagers returning one registered TaskManagertaskmanagers=1, slots-total=16, flink-version=2.1.0After the core deployment, we also validated observability.
The final working monitoring pattern included:
prometheus: 92499249/TCPflink.readiness.rulesmonitoring/prometheus-k8s to list/watch Flink podsDo not treat the base staging values as production defaults. Review:
taskmanager.numberOfTaskSlotsparallelism.defaultIf validation fails, use Helm history and roll back to the last known-good revision.
helm --kubeconfig "$KUBECONFIG" --kube-context "$CTX" history "$RELEASE" -n "$NS"
helm --kubeconfig "$KUBECONFIG" --kube-context "$CTX" rollback "$RELEASE" <REVISION> -n "$NS" --wait --timeout 15m
Do not call the deployment complete until all of these are true: