Grafana is the leading open source metric suite for analytics and visualization that is commonly used for analyzing time series data. It can also be used in many other domains, such as home automation, industrial sensors, process control, weather, etc. Grafana can be connected with any data source that offers databases like ElasticSearch, Graphite, Influx DB, Prometheus, PostgreSQL, MySQL, and can also be integrated seamlessly into your workflow.

The Setup

The Grafana Cloud Hosted Prometheus service is based on Cortex. Cortex is a CNCF project used to build highly available, horizontally scalable, multi-tenant, and a long-term storage Prometheus service. The Cortex architecture is based on individual micro-services where each microservice handles a different role, such as storage, replication, querying etc. Cortex is under very heavy development, with new features being continuously added, along with increases in performance. Grafana is eagerly deploying new releases of Cortex to their Kubernetes clusters, where customers can enjoy greater benefits, such as zero downtime.

On Friday, July 19, Grafana faced an approx. 30-minute outage in their Hosted Prometheus service. Grafana offers plenty of visualization options to their clients to understand their data beautifully. Grafana took this outage very seriously, and fixed the issue as quickly as possible, but we’ll discuss what happened, how they responded to the outage, and what they are doing to ensure that it doesn't happen again.

The Challenge

Upgrading the Cortex Ingester service requires an extra Ingester replica to achieve zero downtime during the upgrade process. This can be achieved by sending in-progress data to new Ingesters one-by-one. But, Ingesters are large in size; they require 4 cores and 15GB RAM for single pod, which is 25 percent of the memory and CPU per machine in their Kubernetes clusters. They typically have more unused resources available than 4 cores and 15GB of RAM on a cluster for upgrading extra Ingesters.

But, most of the time, they don’t have 25 percent of “empty” on a single machine during normal processes. To solve this issue, they chose to use Kubernetes Pod Priorities, which designated a higher priority to Ingesters than other microservices. And, when they are required to run N+1 Ingesters, they temporarily preempted the other smaller pods, and these pods rescheduled again on other machines of the cluster, while leaving enough resources to run extra Ingesters.

On Thursday, July 18 (a day before outage incident), one of Grafana’s engineers deployed 4 new priority classes to their clusters: critical, high, medium, and low. They ran their internal cluster with these priorities for one week without any customer traffic. The medium priority class was set as the default for pods which didn’t have an explicit priority, high priority class was set to Ingesters, and critical priority class was set to monitoring jobs such as kube-state-metrics, Prometheus, node-exporter, Alertmanager, etc.

The Event

On Friday, July 19, they ran a new Cortex cluster and the config of the cluster didn’t include new Pod Priorities, which resulted in assigning default priority (medium). The existing Kubernetes cluster did not have enough resources for the new Cortex cluster for updating new Ingesters, as per assigned priorities. As a result, the new Ingester was assigned a default priority (medium), the existing production Pods had no priority, and an Ingester was preempted by the new cluster’s Ingester from the existing Cortex cluster of the production setup.

In the production Cortex cluster, a new Ingester Pod was created with a default priority as medium to maintain the correct quantity of replicas for the preempted Ingester. Then, cascading failure was triggered, resulting in the preemption of all Ingester Pods for Cortex clusters in production.

The Root Cause

Ingesters in Cortex cluster exist and are responsible for holding data up to 12 hours. By doing this, data is compressed more efficiently before writing it to long-term storage. Then, data is shared, series-by-series with the help of a Distributed Hash Table (DHT) and each series is replicated to three Ingesters using Dynamo-style quorum consistency. Data is not written on those Ingesters that are shutting down, and as a result, Cortex couldn’t achieve sufficient writes replication when a larger number of Ingesters was leaving the DHT, and they failed.

The Fix

In the first 4 minutes of the outage, Grafana’s new error budget based Prometheus generated the alerts. And, they diagnosed and scaled-up their Kubernetes cluster within the next 5 minutes to efficiently accommodate the existing and new production clusters. By doing this, the Cortex clusters became available again, data was successfully written on the old Ingesters, and the new Ingesters scheduled successfully. Within the next 10 minutes, they diagnosed and fixed the out-of-memory (OOM) errors.

The clusters went down for approx. 26 minutes, no data was lost during this outage, and the Ingesters successfully flushed all of the data stored in memory to long-term storage. During this outage, writes were buffered by customer’s Prometheus servers with the help of the new WAL-based remote write code that was written by Callum Styan at Grafana Labs and failed writes were replayed after the outage.

This outage issue can be prevented by validating Kubernetes config with CriticalHop before pushing changes to cluster. CriticalHop offers AI-first Kubernetes Guard, which verifies and measures the changes in your Kubernetes environment before risking it in production environment.

Conclusion

Grafana quickly learned from their mistakes and decided not to set “medium” as the default priority until they set high priority to their production Ingesters.

Another way to prevent such an issue is to use CriticalHop Kubernetes config validator. CriticalHop helps in minimizing the Kubernetes issues by running autonomous checks, config validations and predicting other risks before they affect the production environment. The AI-first Kubernetes Guard navigates through a model of possible actions of events that could lead to an outage by modeling the outcomes of changes and applying growing knowledge of config scenarios.