Blog

Kubernetes upgrading from autoscaling/v2beta1 to autoscaling/v2beta2

We use a HorizontalPodAutoscaler Kubernetes resource to scale Pods that work off items from our AWS SQS queues. We found the scale-up to be very aggressive and wondered whether the new version would help. I couldn’t find any documention about the syntax change in v2beta2 for object metrics. Since I spent more than a hour working it out from the raw spec, I thought I would put the changes here in case it helps anyone else.

Before (v2beta1):

apiVersion: autoscaling/v2beta1
 kind: HorizontalPodAutoscaler
 metadata:
   name: pph-notifications-listener
   namespace: default
 spec:
   scaleTargetRef:
     apiVersion: apps/v1
     kind: Deployment
     name: pph-notifications-listener
   minReplicas: 1
   maxReplicas: 5
   metrics:
 type: Object
 object:
   metricName: redacted_qname_sqs_approximatenumberofmessages
   target:
     kind: Namespace
     name: default
   targetValue: 250 

After (v2beta2)

apiVersion: autoscaling/v2beta2
 kind: HorizontalPodAutoscaler
 metadata:
   name: pph-notifications-listener
   namespace: default
 spec:
   scaleTargetRef:
     apiVersion: apps/v1
     kind: Deployment
     name: pph-notifications-listener
   minReplicas: 1
   maxReplicas: 5
   metrics:
 type: Object
 object:
   metric:
     name: redacted_qname_sqs_approximatenumberofmessages
   describedObject:
     kind: Namespace
     name: default
   target:
     type: Value
     value: 250 

The HPA gets these metrics from our Kubernetes custom Metrics API which gets them from Prometheus which gets them via a ServiceMonitor from sqs-exporter which gets them from CloudWatch. Simple!

Look how aggressive the v2beta1 scale-up is:

See how the moment, the value goes over the target, Pods are scaled up to the max! The problem is that, because we use EKS, which is a managed service, and the kube-controller-manager runs on a master node, we can’t change some of the key settings like --horizontal-pod-autoscaler-sync-period or --horizontal-pod-autoscaler-downscale-stabilization (ref).

Update: Unfortunately upgrading to v2beta2 didn’t help with our aggressive scale-up problem:

Figuring out this issue is important because it causes a surge in resource requests which causes our cluster size to grow and shrink needlessly and causes more Pod churn than necessary which makes observability harder and generates more logs than necessary.

Tags >

No Comment

Post A Comment