Apache Flink Integration | Deploy on Shakudo

Apache Flink Knowledge Base

Apache Flink Overview

Apache Flink is a platform for processing data continuously. It is a strong fit when you need fast, stateful, and reliable handling of streams, events, or near-real-time pipelines.

What Apache Flink Is

Flink helps teams process data as it arrives instead of waiting for large scheduled batches.

That makes it useful for:

live event processing
streaming ETL and enrichment
real-time dashboards and KPIs
rule-driven or model-assisted decisions
applications that depend on continuously updated state

What a Standard Deployment Looks Like

Our reference Kubernetes deployment uses separate Flink components for control and execution.

In simple terms:

JobManager controls the cluster and exposes the UI and REST API
TaskManagers provide execution capacity for running jobs
Services and ingress expose the UI and APIs to the right users
Monitoring collects metrics and alerts so the platform can be operated safely

In our staging environment, the reference deployment was:

Helm-managed
deployed in namespace hyperplane-flink
exposed behind SSO at https://flink.staging.canopyhub.io
monitored through Prometheus on port 9249

What We Validated in Staging

The following areas were confirmed with live checks:

Helm release was deployed successfully
JobManager and TaskManager were running
UI and history endpoints were reachable behind auth
REST endpoints such as /overview and /taskmanagers responded correctly
Prometheus targets for both JobManager and TaskManager were up
readiness alert rules were loaded in Prometheus

This means the base platform pattern works well for staging and integration testing.

What Was Still Missing for Full Customer Readiness

A healthy deployment is only the starting point. In staging, the cluster was operational, but several production-grade topics still required follow-up.

The biggest gaps were:

no JobManager high-availability configuration
no durable checkpoint or savepoint storage strategy
no representative workload recovery validation
manual scaling only
container images still coming from a public registry

Where Apache Flink Fits Best

Flink is a good choice when customers need:

continuous data movement between systems
real-time transformations and enrichments
long-running stateful jobs
low-latency processing with clear operational visibility

It is usually not the best first choice for:

simple nightly batch work with no latency pressure
teams that do not yet have a plan for monitoring and recovery
highly restricted environments that have not approved image and storage design

What Customers Should Decide Before Deployment

Before moving into a customer environment, align on:

expected workload type and size
number of TaskManagers and slot plan
CPU and memory expectations
checkpoint and savepoint storage location
HA expectations for the JobManager
auth model and access path for the UI
monitoring and alert-routing requirements
container registry policy for connected or airgapped environments

One-Line Summary

Apache Flink gives customers a strong foundation for real-time data processing, but a production-ready deployment requires clear decisions around resilience, state storage, scaling, and operations.

Getting Started & Usage

This page helps customers move from a healthy deployment to a useful first experience. The goal is not to run a full production workload on day one. The goal is to confirm the cluster is understandable, reachable, and ready for a small representative test.

Start With a Platform Check

Before submitting any workload, confirm the platform basics:

you can reach the Flink UI through the agreed access path
the JobManager is healthy
the expected TaskManagers are registered
the slot count matches your expected starting capacity
recent logs do not show startup failures or repeated restarts

In our staging checks, /overview and /taskmanagers were the fastest way to confirm that the cluster was usable.

What To Look For in the UI or API

At a minimum, confirm:

Flink version is the expected release
taskmanagers count is correct
slots-total and slots-available look reasonable
there are no unexpected failed jobs already present
the cluster is not silently running with zero usable capacity

Our staging validation returned:

taskmanagers=1
slots-total=16
slots-available=16
jobs-running=0

That told us the cluster was healthy but still idle.

Your First Recommended Workload

Start with a small, low-risk workload before running business-critical jobs.

Good first tests are:

a simple transformation pipeline
a small event enrichment flow
a state-light streaming job
a short-lived validation job that confirms end-to-end execution

The first workload should help you answer:

can the job be submitted successfully?
does it appear in the UI?
do logs stay clean during startup?
does the output land where you expect?

Suggested First-Use Flow

Open the UI and confirm the cluster is healthy.
Check the TaskManager count and slot availability.
Submit one small representative job through your normal delivery path.
Watch the job move into running or finished state.
Review JobManager and TaskManager logs.
Confirm application output in the downstream system.
Capture the result as a go/no-go note for larger workloads.

Operational Checks During Early Usage

While the first jobs are running, pay attention to:

CPU and memory pressure
slot consumption
TaskManager stability
restart count
backpressure or slow-processing symptoms
alert noise in your monitoring stack

This is where you begin validating whether the current sizing fits your actual workload.

What We Learned From Staging

Our staging environment proved the base platform and monitoring path, but it did not prove full workload recovery.

That means customers should not assume the following are already solved just because the cluster is up:

checkpoint durability
savepoint handling
JobManager failover behavior
recovery after infrastructure failure
scale behavior under heavy load

Daily Usage Guidance

For normal day-to-day use, keep these habits:

check cluster health before pushing major job changes
keep job and platform changes separate when possible
review recent alerts before high-impact releases
capture savepoints before risky job changes when your operating model supports it
record the expected slot and parallelism impact of each new workload

Recommended Early Success Criteria

A strong first-use milestone looks like this:

the cluster is reachable and healthy
one representative workload runs successfully
logs and metrics look normal
the team understands where to check health, capacity, and alerts
next steps for HA and durable state storage are agreed before production traffic

Administration & Best Practices

This page focuses on the operating decisions that turn a working Flink install into a manageable customer platform. The main lesson from staging was simple: platform health, monitoring, and production readiness are related, but they are not the same thing.

Capacity and Scaling

Flink capacity is shaped by more than pod count alone. Customers should review:

number of TaskManagers
task slots per TaskManager
default parallelism
CPU and memory per JobManager and TaskManager
expected traffic bursts and steady-state load

In staging, the cluster was healthy with:

taskmanagers=1
taskmanager.numberOfTaskSlots=16
parallelism.default=1

That was enough for platform validation, but not enough to prove customer sizing.

Observability

Good observability should be part of the deployment, not an afterthought.

The staging pattern that worked included:

Flink-native Prometheus metrics
dedicated metrics port 9249
scrape coverage for both JobManager and TaskManager
alert rules for JobManager availability, TaskManager availability, and pod restarts

Best practices:

confirm metrics are coming from Flink, not only from sidecars
alert on workload health and restart behavior
review logs after every upgrade
keep a simple dashboard for cluster health, slot usage, and restart count

Security and Access

A customer-ready Flink deployment should use clear workload identity and controlled access.

Best practices:

use dedicated service accounts for JobManager and TaskManager
avoid unnecessary service-account token mounting
protect the UI with the agreed auth path
review network policies for required ports only
make sure operational access ownership is clear before go-live

In staging, moving away from the default service account was an important improvement.

Change Management

Most operational issues are easier to handle when deployment changes are controlled.

Recommended habits:

always dry-run Helm changes before upgrade
back up the current release before mutating it
separate infrastructure changes from workload changes when possible
capture the last known-good Helm revision
validate UI, REST, pods, and metrics after every rollout

Resilience and State Management

This is the most important production topic for many customer deployments.

Before production, customers should define:

JobManager HA mode
durable storage for checkpoints
durable storage for savepoints
recovery expectations after pod or node failure
rollback and restore expectations for critical jobs

Our staging work showed that these items were still open, which is why the platform was not yet fully customer-ready.

Image and Environment Policy

Container image sourcing matters, especially in restricted environments.

Best practices:

mirror images into an approved internal registry when required
confirm support posture for the chosen image source
avoid treating public-registry defaults as final production policy
document outbound network dependencies early

Recommended Operating Checklist

Use this simple checklist before each major release:

Helm chart and values reviewed
dry-run completed
backup captured
pods healthy after rollout
UI and REST validated
metrics scraping confirmed
alerts loaded
sizing impact reviewed
state and recovery assumptions documented

Practical Bottom Line

A good Flink administrator treats deployment, observability, and recovery design as one operating model. If any of those are missing, the cluster may still start, but it will be harder to run with confidence.

Troubleshooting & FAQ

This page is based on real issues we saw while validating Apache Flink in staging. The format is simple: Problem → Check → Fix.

Problem — Pods Stay Pending After a Helm Upgrade

Check

inspect the rendered nodeSelector
compare it with the labels on the target nodes
review kubectl describe pod scheduling events

What happened in staging

the base values pointed to hyperplane-stack-component-pool
the real nodes were labeled hyperplane-system-pool

Fix

keep the main values file
add an environment-specific override file with the correct node selector
dry-run again before redeploying

Problem — The UI Loads, But You Still Do Not Know If the Cluster Is Healthy

Check

query /overview
query /taskmanagers
confirm pod health and restart counts

Fix

do not rely on the login screen or UI shell alone
validate that the JobManager API responds and at least one TaskManager is registered

Problem — Prometheus Is Scraping Sidecar Metrics Instead of Flink Metrics

Check

confirm Flink-native metrics are enabled
look for Prometheus reporter startup in JobManager and TaskManager logs
verify the metrics port is 9249

Fix

enable the Flink Prometheus reporter
add a named container port such as prometheus: 9249
make sure network policies allow access to that port

Problem — Prometheus Targets for Flink Never Become Healthy

Check

inspect the PodMonitor
confirm it points to the named port, not an outdated field
verify Prometheus can discover pods in the Flink namespace

What happened in staging

the PodMonitor used deprecated targetPort
Prometheus also lacked RBAC to list/watch Flink pods

Fix

change the PodMonitor to use port: prometheus
grant the monitoring service account permission to list/watch pods in the Flink namespace

Problem — Alert Rules Exist, But They Are Not Active

Check

confirm the PrometheusRule is created in a namespace Prometheus actually watches
verify the rules appear in the active Prometheus rule set

Fix

move or create the PrometheusRule where the monitoring stack can load it
re-check the live rules after applying the change

Problem — Flink Workloads Use the Default Service Account

Check

inspect the running JobManager and TaskManager pod specs
confirm serviceAccountName is set explicitly

Fix

assign dedicated service accounts to the JobManager and TaskManager
disable automatic token mounting unless it is truly needed

Problem — The Deployment Is Up, But It Still Is Not Production-Ready

Check

is HA configured?
are checkpoints and savepoints stored durably?
has a representative workload been tested under failure?
is scaling defined and validated?
are images coming from an approved registry?

Fix

treat these as production-readiness tasks, not optional polish
close them before calling the platform customer-ready

Problem — A Long Helm Command Gets Stuck at `dquote>`

Check

review shell quoting
confirm the command was pasted as a complete single command

Fix

rerun the command as one clean line
avoid broken multiline quoting during live deployment calls

FAQ

Is a working Flink UI enough to say the deployment succeeded?

No. Also validate pods, REST endpoints, registered TaskManagers, metrics, and alerts.

Is one TaskManager enough for production?

Not by default. It may be fine for staging or early validation, but production sizing depends on workload and recovery goals.

Do we need HA before customer production use?

For important customer workloads, yes. A single JobManager with no HA is a major resilience gap.

Do we need checkpoints and savepoints before production?

For stateful jobs, yes. They are central to recovery, planned upgrades, and safe rollback.

Can we use public container images in customer environments?

Sometimes, but many customer environments require mirrored or approved internal registries. Decide this early.

What is the fastest healthy validation after deployment?

Check Helm status, pod health, /overview, /taskmanagers, and Prometheus target health.

Data Streaming

What is Apache Flink, and How to Deploy It in an Enterprise Data Stack?

Apache Flink

What is Apache Flink?

Watch Apache Flink in action

Read more about Apache Flink

Why is Apache Flink better on Shakudo?

Apache Flink Knowledge Base

Apache Flink Overview

What Apache Flink Is

What a Standard Deployment Looks Like

What We Validated in Staging

What Was Still Missing for Full Customer Readiness

Where Apache Flink Fits Best

What Customers Should Decide Before Deployment

One-Line Summary

Getting Started & Usage

Start With a Platform Check

What To Look For in the UI or API

Your First Recommended Workload

Suggested First-Use Flow

Operational Checks During Early Usage

What We Learned From Staging

Daily Usage Guidance

Recommended Early Success Criteria

Administration & Best Practices

Capacity and Scaling

Observability

Security and Access

Change Management

Resilience and State Management

Image and Environment Policy

Recommended Operating Checklist

Practical Bottom Line

Troubleshooting & FAQ

Problem — Pods Stay Pending After a Helm Upgrade

Problem — The UI Loads, But You Still Do Not Know If the Cluster Is Healthy

Problem — Prometheus Is Scraping Sidecar Metrics Instead of Flink Metrics

Problem — Prometheus Targets for Flink Never Become Healthy

Problem — Alert Rules Exist, But They Are Not Active

Problem — Flink Workloads Use the Default Service Account

Problem — The Deployment Is Up, But It Still Is Not Production-Ready

Problem — A Long Helm Command Gets Stuck at dquote>

FAQ

Is a working Flink UI enough to say the deployment succeeded?

Is one TaskManager enough for production?

Do we need HA before customer production use?

Do we need checkpoints and savepoints before production?

Can we use public container images in customer environments?

What is the fastest healthy validation after deployment?

Why is Apache Flink better on Shakudo?

Why is Apache Flink better on Shakudo?

Core Shakudo Features

Own Your AI

Faster Time-to-Value

Flexible with Experts

Problem — A Long Helm Command Gets Stuck at `dquote>`