Intro

Many organizations are modernizing their existing applications to become more agile and innovate faster. Architectural patterns like microservices enable teams to independently test services and continuously push applications to delivery environments. This approach optimizes team productivity by allowing development teams to experiment and iterate faster. It also allows teams to rapidly scale how they build and run their applications.

As you build new services that all need to work together as an application, they need ways to connect, monitor, control, and debug the communication across the entire application. For example, you may need to include service discovery, application-level metrics, logs, and traces to help debug traffic patterns, traffic shaping, and to secure inter-service traffic.


Many organizations are modernizing their existing applications to become more agile and innovate faster. Architectural patterns like microservices enable teams to independently test services and continuously push applications to delivery environments. This approach optimizes team productivity by allowing development teams to experiment and iterate faster. It also allows teams to rapidly scale how they build and run their applications.

As you build new services that all need to work together as an application, they need ways to connect, monitor, control, and debug the communication across the entire application. For example, you may need to include service discovery, application-level metrics, logs, and traces to help debug traffic patterns, traffic shaping, and to secure inter-service traffic.

The Setup

Civis Analytics is a data science software and consultancy company working with Fortune 500 companies and the country's largest organizations, including Boeing, Airbnb, FEMA, Verizon, Discovery, and the American Red Cross.

Their Data Science platform is helping leading commercial, nonprofit, and government organizations leverage data to make better decisions by providing data scientists with the tools, algorithms, and data they need to complete their workflows and to ultimately make their organizations data-driven. 

They use Kubernetes for Jupyter Notebooks in the cloud. The product allows clients to deploy ad hoc containerized Jupyter Notebooks onto Kubernetes. It allows clients to have a platform where they can do all their exploratory analysis in a single place, seamlessly connecting to their data assets, and requesting the compute power needed to run their computations without having to worry about the intricacies of managing application infrastructure.

They have recently released Python and R Container scripts that allow clients to run Docker Containers in the cloud without having to worry about the underlying infrastructure. Clients have the ability to schedule these scripts, string them together into a Platform Workflow, request the necessary compute needed to run their container, and connect to the necessary data assets.

The Challenge

With Kubernetes becoming a core component of their infrastructure they felt it was the right time to leverage the capabilities of Kubernetes for the scripts service. Based on their anticipated usage and future growth plans, they set their goal to have a cluster that can scale to 500 nodes with a goal to stretch that to 1,000 nodes. 

Immediately after migrating the service to Kubernetes, they began to see stability issues with their Kubernetes cluster. Developers reported high latency for kubectl commands while users intermittently reported having difficulty launching notebooks and services.

The Event

They discovered serious issues when the cluster was overloaded with a large number of scripts. The large number of scripts pushed the cluster to a total size of about 290 EC2 instances; their previous usage had never pushed them past 80 EC2 instances. This manifested itself in high CPU on their master nodes. Further investigation found that the three Kubernetes API containers were pegged on CPU. 
The API servers were given unbounded CPU, so they chewed through the node’s CPU. This caused resource contention among other control plane components, rendering the cluster useless. Furthermore, the K8s API server would chew through all the memory on the node before restarting. Once they moved users back to the old infrastructure, the cluster began to stabilize.

The Root Cause

The mistake made here was that they assumed Kubernetes would scale seamlessly. Kubernetes claims it can support clusters with up to 5,000 nodes and 100,000 pods. In the real world, this won't happen out of the box. There are many configurations that need to be adjusted in order to achieve that level of scale.

After combing through the API access logs, they learned that their Datadog and Fluentd Daemonsets were aggressively polling the API server for metadata about K8s services causing increased load. A Kubernetes Daemonset manages the scheduling and lifecycle of pods such that exactly one pod will run on each node in the cluster. They were running a number of Daemonsets in their cluster: Datadog Agents for monitoring; Sumologic collectors for logs; kube2iam to manage IAM access for their containers to the underlying EC2 metadata in a way that allowed them to adhere to the principle of least privilege.. 

These agents would collect the names of all services in the cluster and any pod running on the agent’s node that was fronted by a service would have its metrics tagged by the service name. With hundreds of Datadog Agents, one per node, hitting the K8s API every x seconds looking for new services, it was causing unnecessary load on the API. Not to mention the tags being collected were not useful for their monitoring purposes.

The Fix

After the outage, their initial reaction was to increase the size of their master nodes. They were running three m4.large instances (2 cores of CPU, 8 GB of Memory) spread across three distinct Availability Zones. Kubernetes recommends for a cluster sized between 251–500 nodes, on AWS, you should use c4.4xlarge instances (16 CPU, 30 GB Memory) for the master nodes. They tested different configurations and eventually settled on using m4.4xlarge instances (16 CPU, 64 GB Memory).

While increasing the master nodes to m4.4xlarge instances did help them scale up, they still weren’t satisfied because they could not comfortably cross 500 nodes. They thought about using even larger instances for the master nodes, but this approach felt like treating a symptom and not the root cause of the problem. 

In their next load test they removed the Sumologic collector, which is built on fluentd, and Datadog Daemonsets. From the subsequent load test, it was clear that Datadog or Sumologic were the culprit for the load. Just to be sure, they ran a follow-up test where they added back Sumologic and the results were similar, their cluster scaled pretty seamlessly to 700+ nodes. This could only mean that Datadog was the problem.

They made two other adjustments to the cluster configuration. The first was increasing the rate limit for non-mutating requests from 400 to 1200. Rate limiting is critical for protecting API servers from load. The second change they made was increasing the amount of disk IOPS for their etcd containers. They were able to see a significant drop in disk write latency when they moved their master nodes to use local disk instead of network backed disk.

They then moved to setup Datadog for scale by leveraging the Datadog cluster agent, a new product that addresses scale. The cluster agent acts as a buffer between the API server and the Datadog pods so only the Datadog Cluster agent can talk to the Kubernetes API. Now, instead of n Datadog pods, where n is the number of nodes in the cluster, hitting the API, you only have a single Cluster Agent communicating wssrth the API. Furthermore, they were also able to leverage Datadog to scrape Prometheus metrics from the cluster's control plane components, therefore removing the overhead of maintaining their own Prometheus set up.

With the Datadog adjustments in place, they reran load tests and saw the cluster easily breeze towards 700+ EC2 instances.

To prevent a Daemonset from running on certain nodes, you can modify the node's taints or the Daemonset's tolerations. This is useful to prevent a Daemonset from targeting specialized nodes that may not have enough resources.

Conclusion

A core paradigm of Kubernetes is auto-discovery. Kubernetes’ watch stream allow services to watch for changes in the cluster state allowing services to respond to changes on demand without manual intervention from system admins. However, for some use cases, this can be expensive. With poor design, you can quickly introduce scaling issues with greedy services watching for changes. This is a common pattern that is seen with a variety of Daemonsets.

The Datadog agents were an extreme case, but other Daemonsets have similar issues. Kube2iam, a service for access control to EC2 metadata, watches for changes to all pods in the cluster so that it can make adjustments to access controls. However, as more pods are added to the cluster, the Kube2iam containers will run out of memory because they are storing all the pods’ metadata in memory. The Fluentd Kubernetes Metadata Filter is a popular plugin that also has the same issue.

Our biggest takeaway from this experience is to be mindful of the third party resources you utilize in your Kubernetes cluster, especially Daemonsets — many tools are built for standard usage, but have not been tested at scale.