Skip to content

Rethinking Airflow Monitoring for a Kubernetes-Native World

TL;DR

Moving Airflow to Kubernetes exposed the limits of our existing monitoring. Static, agent-based approaches struggled in a dynamic system. We needed something that adapted automatically, reduced operational overhead, and gave better visibility into workflows.

Kubernetes changes how infrastructure behaves. Workloads are ephemeral, services scale dynamically, and the system is in constant motion. Monitoring approaches built for static environments do not translate well into this world.

We learned that firsthand.

As our Airflow deployment moved onto Kubernetes, the assumptions behind our existing monitoring setup stopped holding. Pods came and went, visibility into workflow behavior became fragmented, and keeping the monitoring system current started to feel like a job of its own.

That tension forced us to rethink not just which tool to use, but what monitoring should mean in a Kubernetes-native environment.

The Problem We Were Trying to Solve#

Monitoring Airflow in Kubernetes is not just about tracking CPU or memory. It is about understanding how workflows behave in a constantly changing system.

Tasks fail, retry, and queue. Schedulers lag under load. Pods restart without warning. Traditional host-based monitoring starts to lose context in this kind of environment.

At the same time, we wanted to avoid increasing operational burden. Any solution we chose had to scale with us, not slow us down.

This led us to evaluate two tools: Zabbix and Prometheus.

Where Zabbix Started to Struggle#

Zabbix is a powerful and mature monitoring platform. In more traditional setups, it provides strong coverage across infrastructure and applications.

But Kubernetes changes the rules.

Zabbix does offer auto-discovery through Low-Level Discovery, and agentless monitoring is possible for some checks. But for full node-level visibility, the recommended approach still involves deploying agents across nodes. In a system where workloads are constantly shifting, that setup requires upfront template configuration and ongoing filter tuning to avoid performance degradation, adding more operational overhead than we wanted.

We also found that monitoring Airflow required additional customization. There was no straightforward way to plug into Airflow's internal metrics without building custom scripts or integrations.

None of these issues were blockers on their own. But together, they created a growing operational overhead that didn't align with where our platform was heading.

Why Prometheus Fit the Kubernetes Model#

Prometheus approached the problem from a completely different angle.

Instead of relying on static configuration, it integrates directly with Kubernetes. Services and pods are discovered automatically. As the system changes, the monitoring system updates with it.

This made a noticeable difference immediately. We spent less time configuring and more time observing.

For Airflow, the integration was also more natural. Metrics could be exposed through exporters and collected without additional glue code. This allowed us to focus on what mattered: DAG performance, task reliability, and system health.

Prometheus also brought a stronger alignment with how we were already thinking about infrastructure. It treats monitoring as part of the system, not something layered on top of it.

Scaling Without Friction#

As we projected future growth, scalability became a key consideration.

Zabbix can scale, but it tends to do so by adding more resources to a central system. This introduces complexity in database management and tuning as the system grows.

Prometheus fits naturally into a distributed environment like Kubernetes. A single instance can be deployed per cluster without central coordination, and for larger scale, tools like Thanos or Grafana Mimir add long-term storage and multi-cluster aggregation on top of it, without requiring changes to the core setup.

This was important for us. We were not just solving today's problem. We were choosing a system that would still work as our platform expanded.

The Operational Reality#

One of the most important lessons from this evaluation was that maintenance matters.

In fast-moving environments, even small operational tasks add up. Systems that require constant tuning or manual updates eventually slow teams down.

Prometheus reduced much of that burden. Its ability to automatically discover services and adapt to infrastructure changes meant fewer manual interventions. Once configured, it largely ran on its own.

That simplicity was not just convenient. It was a strategic advantage.

The Decision#

In the end, the choice became clear.

Prometheus aligned better with our architecture, our workflows, and our long-term direction. It gave us deeper visibility into Airflow while reducing the effort required to maintain that visibility.

We decided to move forward with Prometheus as our primary monitoring solution.

That decision reflected something larger than a tool swap. It was a shift in how we think about monitoring in a dynamic, container-native environment: from something you configure once and maintain, to something that adapts alongside the system it observes.