Intro

Containers are a well-established way of packaging an application. Kubernetes has also gotten out of the early-adopters phase. Today it is a widely held view that Kubernetes is a cost-effective, ready-made solution that enterprise customers can trust. Kubernetes' unique selling point is its ability to:

  • Tackle scaling, security, observability and deployment challenges at a platform level instead of per-application
  • Encourage exploratory development with minimal outlay and pay-as-you-go pricing
  • Provide low latency and high availability at all times

The Challenge

NU.nl is a top ranked site in the Netherlands. When NU.nl's parent company acquired broadcast rights to Champions League football, the development team started looking for a cost-effective, ready-made solution they could trust. This new system had to avoid downtime on match days when its hardware was often strained. NU.nl eventually opted to rebuild its application from the ground up on AWS. It can now scale easily for major match days and other key events with 80% less IT involvement.

AWS offers continuous availability and transparent maintenance, with no scheduled downtime or patching requirements. This feature influenced their decision since planning for scheduled downtime of cloud instances just seems ridiculous in this day and age. 

The Setup


Multiple clusters

3 AWS accounts, 3 clusters:

  • osc-nu-prod
  • production
  • osc-nu-test
  • test
  • staging
  • osc-nu-dev
  • proofing infra changes

To deploy these Kubernetes clusters to AWS, NU.nl is using Kops, an official Kubernetes project for managing production-grade Kubernetes clusters. 

The Event

1. No Memory Limits

A misconfigured ElastAlert was trying to read the entire ElasticSearch index.

2. No CPU Limits

Rapid traffic increase was affecting kube-system components e.g. ingress, kubelet etc.

Prometheus metrics showing rapid traffic increase affecting kubelet and ingress.

CPU burstable pods causing nodes to consume 100% CPU.

3. Memory Limits

After upgrading from MongoDB 3 to MongoDB 4, the application memory footprint was increased significantly and CPU based scaling stopped kicking in. The new increased memory footprint caused Pods to be terminated.

The Root Cause

Incident 1

ElastAlert is a useful framework for alerting on spikes, anomalies, or other patterns of interest from data in ElasticSearch.

Like any datastore, ElastAlert can be brought to its knees under the right conditions for example;  if it contains a substantial amount of data compared to the available hardware or if the traffic is too high. In this case a bad config in ElastAlert caused it to load the entire 90GB ElasticSearch index in memory causing the ec2 instance to hit its memory limit.

Incident 2

The developers were alerted to a few nodes going into a NodeNotReady state. Naturally they described the nodes to find out what was being reported. In this case kubelet stopped posting status, which is not uncommon. They checked the cluster metrics in Prometheus and found out that CPU utilization had gone up to 100% affecting ingress and kubelet.

Incident 3

The next time, they reported frequent restarts of their services deployed on Kubernetes. They were unable to understand why the POD returned a OOMKilled status.

When the POD has a maximum memory limit defined in HorizontalPodAutoscaler, and when the POD memory usage crosses beyond the specified limit, the POD will get killed thus returning a OOMKilled status. This will happen even despite the node having enough free memory.

The Fix

Incident 1 - Accidentally trying to load an ElasticSearch index of 90GB

To stop the bleeding they had to shutdown ElastAlert.

Permanent Fixes:

  • Don’t load the entire ElasticSearch index
  • Apply limits on memory (RAM) usage

Incident 2 - Rapid traffic increase affecting core components

Permanent Fixes:

  • Reducing CPU burstable amount of pods
  • Increasing skipper’s resource requests
  • Reserving CPU & memory for kubelet and kubesystem
  • --kube-reserved
  • --system-reserved

Incident 3 - Application update increasing memory footprint

To stop the memory bleed, they increased the memory limit of Talk pods. 

Permanent fixes:

  • Adjusting CPU request/limit & HPA thresholds
  • Scaling on both CPU and memory
  • Setting the memory limit higher than request to prevent a ‘snowball effect’

Conclusion

Managed cloud services are so easy to use. Running Kubernetes on AWS gives NU.nl better security, strong compliance, and the peace of mind that when the biggest match days roll around, they won’t have any downtime. Infrastructure and tools that used to be required to scale and provide resiliency are no longer needed. With AI planning and tools like kubectl-val replacing human intervention, accidental downtime is much less of a concern.