Lesson learned while scaling Kubernetes cluster to 1000 pods in AWS EKS
🤔 Why are we even discussing this topic? This look like a simple case of increasing the number of pods in the EKS cluster. Karpenter(should take care of scaling out my nodes) and horizontal pod Autoscaler(HPA) and replicaset should take care of scaling pod. If you think it’s that easy, stop 🛑 here. If you believe there is a twist in the tale, continue reading 📖. Things don’t behave as expected, especially when we start to push outside of the normal scale ⚖️.
Challenges
❌ Some common challenges you may observe when you start pushing boundaries:
- AWS resource limits
- You will run out of IP address, i.e., IP address exhaustion
- Packets drop
- Control plane performance issues
Workload calculations
📱 Workload Calculation for 1000 pods
🖥 1–2 CPU per pod
🖥 2–4 GB of RAM
🟰 Total: 2000 CPU and 4000GB of RAM
🎬 Let’s start with low-hanging fruit 🍒 or the issue which I more obvious to me or the issues which I expect to hit
1️⃣ AWS Quota limit: As I am planning to create 1000 pods using spot instance, please check your quota limit for a particular instance type, e.g., for the All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests there is a quota limit set to 96 instances which may or may not be enough for this use case. I opened a support ticket with AWS support, and they are pretty responsive as always and able to increase the limit within a few minutes. Check the aws doc https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html
2️⃣ VPC should have a sufficient free IP address pool: Make sure the VPC where you deploy your pods should have sufficient IP blocks. For deploying 1000 pods, you atleast start with /21 CIDR blocks (or depending upon your requirement)so that you have atleast 1000 free IP addresses, also take into account future growth. The IP needs to be used by other resources like a load balancer or if you plan to use a sidecar/daemonset for log forwarding, etc.
3️⃣ Run one EKS cluster per VPC: Running at scale, AWS recommends one EKS cluster per VPC so that Kubernetes will be the only consumer of IP address within the VPC. Any services outside Kubernetes that consume the IP address will ultimately limit the number of pods you can deploy into your cluster.
🔨 Less obvious ones and based on your choice of tools
1️⃣Use of bottlerocket OS vs Amazon Linux EKS optimized AMI: Bottlerocket is an open-source operating system that’s purpose built for running containers. Many general operating systems have the vast majority of software they never need, contributing to the additional overhead on the nodes. This is a minimal security focussed operating system, only what is needed to run containers and improve resource utilization. So rather than going with Amazon Linux 2 EKS optimized AMI, I decided to go with bottlerocket and see some performance improvement.
2️⃣ Use of Karpenter as cluster auto-scaler: One of the biggest advantages of using Karpenter vs. Cluster AutoScaler is Karpenter directly calls the EC2 API to launch or remove nodes as well as dynamically chooses the best EC2 instance types(as shown in the code below)or computes resources for the workload, whereas in case Cluster AutoScaler working on AutoScaling group that needs to homogenous, i.e., all the servers need to be of the same configuration(same CPU and RAM). Karpenter has full knowledge of what AWS does and is not constrained like Cluster Autoscaler, which has many EC2 AutoScaling Group(ASG) features disabled(predictive scaling, node rebalancing, etc).
2022-11-23T01:40:51.723Z INFO controller.provisioning launching node with 5
pods requesting {"cpu":"5125m","pods":"7"} from
types m3.2xlarge, c3.2xlarge, c5d.2xlarge, r3.2xlarge, c5.2xlarge
and 211 other(s) {"commit": "470aa83", "provisioner": "default"}
3️⃣ Choose instance type for worker node: Choose the instance type for the worker node: You need to choose the instance type where you run your workload. Here you will see the advantage of using Karpenter, which automatically chooses the instance type which doesn’t need to be homogenous. You also need to choose the pricing mode to optimize the cost. EKS supports on-demand, saving plans, and reserved and spot instances(your workload can be interrupted and shut down within 2 min of the grace period). I ran a few tests and found that I am getting almost similar performance with spot instances with 70–90% less cost. Please check this blog for more info https://www.101daysofdevops.com/courses/100-days-of-aws/lessons/day-48/
4️⃣ CloudWatch Container Insight: Use of container insight for centralized logging and metrics: To figure out what’s going on in your application pods, we need an agent that forward all the logs and metrics to a centralized location for easy searching and analysis.CloudWatch Container Insights provides a single pane to view all the metrics and logs. CloudWatch container insight uses fluent bit and cloudwatch agent(as daemon set), and they have optimized the API call usage pattern not to make these periodic list pods calls to fetch pod metadata but rather talk locally to the kubelet server endpoint to get the metadata about pods rather than querying the API server. Check this doc how to install container insight https://www.eksworkshop.com/intermediate/250_cloudwatch_container_insights/cwcinstall/
🐘 Now the big and major ones
1️⃣ Understand how IP address is allocated to EKS pods: You also need to understand that EKS uses VPC IP addresses for your pods. When your worker node instance boots up, it gets an elastic network interface(ENI), and that ENI has a primary IP address from your VPC subnet. Each ENI is capable of supporting multiple IP addresses. Depending upon the instance type, it’s capable of additional ENI, each with its primary and secondary IP addresses. EKS uses these secondary IP addresses by assigning an IP address to your pod, and these pods will act as first class citizens of your VPC.
Let’s take a simple example m5.large supports Max 3 network interfaces and 30 IPv4 or IPv6 per interface. To calculate the maximum number of pods that any given instance type can support.
Max pods = (Number of network interfaces for the instance type * (the number of IP addresses per network interfaces -1)) + 2
Based on the above formulae for m5.large
Max pods = (Number of network interfaces for the instance type * (the number of IP addresses per network interfaces -1)) + 2
Max pods = (3)* (10 -1) + 2= 3*9 + 2 = 27 + 2 =29
You can do the complex calculation 😆 or check this link where AWS already figure it out https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt or download the max pod calculator https://docs.aws.amazon.com/eks/latest/userguide/choosing-instance-type.html . If you need additional IPs for your pod, you need to switch to a different instance family, e.g., m5.xlarge support 58IP.
2️⃣ Tune your Amazon CNI plugin: Amazon Container Network Interface(CNI) has configuration parameters that you can use to tune the behavior of the CNI, which can help optimize the number of IP addresses assigned to your worker nodes. The default value of WARM_ENI_TARGET is set to 1, which tells the CNI to warm up one ENI and all available addresses on that ENI. CNI will warm up an additional ENI and all the available IP addresses as soon as the pod consumes the IP address. This default behavior provides a good balance between IP address consumption and the ability to scale the pod quickly. So in the m5.large CNI will pre-warn three ENIs with a total of 30 IP addresses. The WARM_IP_TARGET specifies the number of free IP addresses CNI will keep available for assignment to the pod on the node. For more information about it
WARM_ENI_TARGET
Use the WARM_ENI_TARGET variable to determine how many elastic network interfaces the L-IPAMD keeps available so pods are immediately assigned an IP address when scheduled on a node.
- To prevent depleting available IP addresses in the subnet, check the worker node instance type and the maximum number of network interfaces and private IPv4 addresses per interface. For example, if you set WARM_ENI_TARGET=3 for an m5.xlarge node, then three elastic network interfaces are always attached. The node then assigns 45 IP addresses, 15 per elastic network interface. Because the 45 IP addresses are reserved for this node, those addresses can’t be used for pods that are scheduled on other worker nodes.
- If you expect your application to scale greatly, you can use WARM_ENI_TARGET to accommodate newly scheduled pods quickly.
WARM_IP_TARGET
Use the WARM_IP_TARGET variable to ensure that you always have a defined number of IP addresses in the L-IPAMD’s warm pool.
- For clusters with low productivity, use WARM_IP_TARGET so that only the required number of IP addresses are assigned to the network interface. This prevents the IP addresses of the elastic network interfaces from being blocked.
Reference: https://aws.amazon.com/premiumsupport/knowledge-center/eks-configure-cni-plugin-use-ip-address/
If you want to update these values, create a yaml file
spec:
template:
spec:
containers:
- name: aws-node
env:
- name: WARM_IP_TARGET
value: 10
- name: WARM_ENI_TARGET
value: 3
and then run the below command
kubectl patch daemonset -n kube-system aws-node "$(cat patch.yaml)"
Now the question is why this parameter tuning is important as in our use case, if we use m5.large, it has 2 CPUs, and our pod will be using 1 CPU, so maximum we can run two pods at a time. With the default CNI configuration, we were consuming 30 IP addresses per worker node before running workloads applications. As you can see, this is a complete waste of IP addresses. By tuning these parameters, you can control the warm pool of IP addresses available on each worker node, which can prevent IP address exhaustion by stopping the CNI from allocating IP addresses to worker nodes that are simply never going to be consumed.
Also, starting from version 1.9 VPC CNI plugin now offers increased pod density which means /28 IP address prefixes(.i.e 16 IP per ENI)to network interfaces on worker nodes. Previously in your worker node, you needed to acquire individual IP addresses for each pod and optionally release them when those pods were brought down. This enables more pods per node and reduces the number of calls to Amazon EC2 APIs.
3️⃣ Setting up right resource limits: Once you are done with scalability testing and come up with optimal configuration, setting up the resource limit is critical. You don’t want to set the resource limit too low, as Kubernetes will schedule more pods on worker nodes with a low limit. Now, if these pods get busy simultaneously, there won’t be enough resources for each pod to hit its CPU limits. This will lead to contention and performance degradation in your application.
apiVersion: v1
kind: Pod
metadata:
name: frontend
spec:
containers:
- name: app
image: images.my-company.example/app:v4
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
🆚 Some other considerations
1️⃣ Scalability Testing: Testing is always the critical part. So let’s say the customer is running 3 pods, allowing them to respond to a particular number of requests. Now they are planning to bring in new customers and expect the workload to be increased to 100X. They scale the number of pods to 300(3 * 100). But is this a realistic approach, or is this how customers suppose to scale? That is why testing is important to understand your pod’s behavior. You can spend sometime figuring out the optimum configuration for your pods by varying the amount of CPU and memory and measuring the number of requests your pod can support.
2️⃣ EKS support for IPv6: EKS cluster now support ipv6 https://docs.aws.amazon.com/eks/latest/userguide/cni-ipv6.html. I have reviewed some blog posts, most of which refer to some improvement when switching to IPv6. Unfortunately, you can’t switch your existing cluster to IPv6, so you must build a new one. Also, moving to IPv6 is a big network change and requires a complete networking revamp. It’s just food for thought.
🙇♂️ Some Assumptions
1️⃣ I am not measuring the performance of the Kubernetes control plane. You can always contact aws support and pre-scale your control plane.
2️⃣ The main aim of this blog is to give you a frame of reference and not to show you the highest level of performance. To get a high level of performance, you can use the custom scheduler, use different runtime(Docker vs. Containers)
⚔ Final Words
Thank you, everyone, for your time in reading this long blog. Also, thanks to AWS support for their prompt response in increasing the quota limit. The last thing I don’t want to check is my AWS bill 🤑🤑. If you have any question, please feel free to reach out to me via https://linktr.ee/prashant.lakhera.