Data Streaming

What is Apache Flink, and How to Deploy It in an Enterprise Data Stack?

Last updated on
May 12, 2026

What is Apache Flink?

Apache Flink is a system designed for efficient, distributed, high-speed stream processing. It offers robust event-time support, exactly-once semantics, and the ability to handle large amounts of data. Flink's differentiation is its robustness in fault-tolerance and latency, along with its powerful stream-batch unification which makes it possible to run batch processing as a special case of stream processing. This makes Flink versatile in a wide range of use cases, from real-time analytics to machine learning, and in environments from business applications to big data analytics. Its flexible windowing and rich function APIs let developers customize and optimize their data processing pipelines, leading to faster insights and decision making.

Watch Apache Flink in action

No items found.

Why is Apache Flink better on Shakudo?

Deployment Runbook

This runbook is based on what actually worked in staging. It is not a generic ideal-state guide.

Scope

This runbook covers the Helm-based deployment pattern that succeeded in staging and the checks that proved the cluster was healthy afterward.

What Worked in Staging

The successful staging flow used:

  • the Helm chart under stack-components/flink/helm
  • helm dependency build before upgrade
  • an environment-specific override for node selection
  • a release backup before upgrade
  • helm upgrade --install with both the base values file and the environment patch
  • post-deploy validation through Helm, Kubernetes, port-forwarding, and Flink REST endpoints

Required Inputs

Before deployment, confirm you have:

  • target kubeconfig and Kubernetes context
  • target namespace and Helm release name
  • DNS and auth ownership
  • approved image registry strategy
  • storage plan for checkpoints and savepoints
  • agreed CPU, memory, slots, and TaskManager sizing

Step 1 — Prepare the Chart

From the Flink chart directory:

cd stack-components/flink/helm
helm dependency build

Expected result:

  • Helm dependencies resolve without errors
  • the chart is ready for dry-run and upgrade

Step 2 — Apply Environment-Specific Overrides

In staging, the base values pointed to the wrong node pool. The working fix was to keep the main values file and add a second patch file.

Example pattern:

jobmanager:
 nodeSelector:
   hyperplane.dev/nodeType: hyperplane-system-pool
taskmanager:
 nodeSelector:
   hyperplane.dev/nodeType: hyperplane-system-pool

Why this mattered:

  • the original valuesOverride.yaml pointed to hyperplane-stack-component-pool
  • staging nodes were actually labeled hyperplane-system-pool
  • without the patch, pods stayed pending instead of scheduling

Step 3 — Back Up the Current Release

Before any upgrade, capture the live release state.

Example pattern:

kubectl --kubeconfig "$KUBECONFIG" --context "$CTX" \\
 get deploy,sts,svc -n "$NS" -l app.kubernetes.io/instance="$RELEASE" -o yaml \\
 > /tmp/flink-release-preupgrade.yaml

This gives you a clean rollback reference in addition to Helm history.

Step 4 — Dry-Run the Upgrade

Use the same files you plan to deploy.

helm --kubeconfig "$KUBECONFIG" --kube-context "$CTX" \\
 upgrade --install "$RELEASE" . \\
 --namespace "$NS" \\
 --values valuesOverride.yaml \\
 --values staging-node-selector-patch.yaml \\
 --dry-run --debug

Check that:

  • the rendered nodeSelector matches the real environment
  • no unexpected image, ingress, or scheduling values appear

Step 5 — Deploy with Helm

The staging deployment that worked used this pattern:

helm --kubeconfig "$KUBECONFIG" --kube-context "$CTX" \\
 upgrade --install "$RELEASE" . \\
 --namespace "$NS" --create-namespace \\
 --values valuesOverride.yaml \\
 --values staging-node-selector-patch.yaml \\
 --wait --atomic --timeout 15m

What success looked like in staging:

  • Helm status moved to deployed
  • JobManager and TaskManager became healthy
  • release revision reached 15
  • chart version was 0.0.5 and app version was 2.1.0

Step 6 — Validate the Rollout

Use a mix of Helm, Kubernetes, and Flink-native checks.

helm --kubeconfig "$KUBECONFIG" --kube-context "$CTX" status "$RELEASE" -n "$NS"
kubectl --kubeconfig "$KUBECONFIG" --context "$CTX" get pods,deploy,sts,svc -n "$NS"
kubectl --kubeconfig "$KUBECONFIG" --context "$CTX" port-forward -n "$NS" svc/flink-jobmanager 8081:8081
curl -s <http://127.0.0.1:8081/overview>
curl -s <http://127.0.0.1:8081/taskmanagers>

Healthy staging signals were:

  • JobManager and TaskManager running with zero restarts
  • /overview returning valid JSON
  • /taskmanagers returning one registered TaskManager
  • taskmanagers=1, slots-total=16, flink-version=2.1.0

Step 7 — Validate Monitoring

After the core deployment, we also validated observability.

The final working monitoring pattern included:

  • Flink-native Prometheus reporter enabled
  • named container port prometheus: 9249
  • network policies allowing 9249/TCP
  • PodMonitor scraping the named port
  • Prometheus rule group flink.readiness.rules
  • RBAC allowing monitoring/prometheus-k8s to list/watch Flink pods

Flink-Specific Settings To Review Before Go-Live

Do not treat the base staging values as production defaults. Review:

  • taskmanager.numberOfTaskSlots
  • parallelism.default
  • JobManager CPU and memory
  • TaskManager CPU and memory
  • checkpoint interval and storage path
  • savepoint path
  • high-availability settings
  • image registry location and support policy

Safe Rollback

If validation fails, use Helm history and roll back to the last known-good revision.

helm --kubeconfig "$KUBECONFIG" --kube-context "$CTX" history "$RELEASE" -n "$NS"
helm --kubeconfig "$KUBECONFIG" --kube-context "$CTX" rollback "$RELEASE" <REVISION> -n "$NS" --wait --timeout 15m

Handoff Criteria

Do not call the deployment complete until all of these are true:

  • Helm reports the release as deployed
  • JobManager and TaskManager are healthy
  • REST endpoints respond correctly
  • UI access path works as expected
  • metrics are scraping successfully
  • alert rules are loaded
  • the team has agreed the next step for HA, durable state storage, and workload recovery testing

Why is better on Shakudo?

Core Shakudo Features

Own Your AI

Keep data sovereign, protect IP, and avoid vendor lock-in with infra-agnostic deployments.

Faster Time-to-Value

Pre-built templates and automated DevOps accelerate time-to-value.
integrate

Flexible with Experts

Operating system and dedicated support ensure seamless adoption of the latest and greatest tools.
See Shakudo in Action
Neal Gilmore
Get Started >