Cluster autoscaler is one of the coolest features of Kubernetes (at least for me). This article will provide a deep explanation of what happens under the hood when a cluster is scaled up or down by the Cluster Autoscaler.

Learn More on how to set up Cluster Autoscaler in Kubernetes running on AWS.

How does scale-up work?

Scale-up creates a watch on the API server looking for all pods. It checks for any unschedulable pods every 10 seconds (configurable by --scan-interval flag). A pod is unschedulable when the Kubernetes scheduler is unable to find a node that can accommodate the pod. For example, a pod can request more CPU that is available on any of the cluster nodes. Unschedulable pods are recognized by their PodCondition. Whenever a Kubernetes scheduler fails to find a place to run a pod, it sets “schedulable” PodCondition to false and reason to “unschedulable”. If there are any items in the unschedulable pods list, Cluster Autoscaler tries to find a new place to run them.

It is assumed that the underlying cluster is run on top of some kind of node groups(In case of AWS it is Auto Scaling Group). Inside a node group, all machines have an identical capacity and have the same set of assigned labels. Thus, increasing a size of a node group will create a new machine that will be similar to those already in the cluster – they will just not have any user-created pods running (but will have all pods run from the node manifest and daemon sets.)

Based on the above assumption, Cluster Autoscaler creates template nodes for each of the node groups and checks if any of the unschedulable pods would fit on a new node. While it may sound similar to what the real scheduler does, it is currently quite simplified and may require multiple iterations before all of the pods are eventually scheduled.

It may take some time before the created nodes appear in Kubernetes. It almost entirely depends on the cloud provider and the speed of node provisioning. Cluster Autoscaler expects requested nodes to appear within 15 minutes (configured by --max-node-provision-time flag.) After this time, if they are still unregistered, it stops considering them in simulations and may attempt to scale up a different group if the pods are still pending. It will also attempt to remove any nodes left unregistered after this time.

How does scale-down work?

Every 10 seconds (configurable by --scan-interval flag), if no scale-up is needed, Cluster Autoscaler checks which nodes are unneeded. A node is considered for removal when all below conditions hold:

  • The sum of cpu and memory requests of all pods running on this node is smaller than 50% of the node’s allocatable. (Before 1.1.0, node capacity was used instead of allocatable.) Utilization threshold can be configured using --scale-down-utilization-threshold flag.
  • All pods running on the node (except these that run on all nodes by default, like manifest-run pods or pods created by daemonsets) can be moved to other nodes. The below types of pods can prevent CA from removing a node:
    • Pods with restrictive PodDisruptionBudget.
    • Kube-system pods that:
      • are not run on the node by default, *
      • don’t have PDB or their PDB is too restrictive (since CA 0.6)
    • Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *
    • Pods with local storage. *
    • Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching anti-affinity, etc)
    • Pods that have the following annotation set:
      • "": "false"
    • While checking this condition, the new locations of all moved pods are memorized. With that, Cluster Autoscaler knows where each pod can be moved, and which nodes depend on which other nodes in terms of pod migration. Of course, it may happen that eventually the scheduler will place the pods somewhere else.
  • It doesn’t have scale-down disabled annotation:
    • "": "true"

If a node is unneeded for more than 10 minutes, it will be deleted. Cluster Autoscaler deletes one non-empty node at a time to reduce the risk of creating new unschedulable pods. The next node may possibly be deleted just after the first one, if it was also unneeded for more than 10 min and didn’t rely on the same nodes in simulation (see below example scenario), but not together. Empty nodes, on the other hand, can be deleted in bulk, up to 10 nodes at a time (configurable by --max-empty-bulk-delete flag.)

Learn More on how to set up Cluster Autoscaler in Kubernetes running on AWS.

Also read FAQ on Cluster Autoscaler.

Feel free to ask any questions in the comments section below if you have any. Hope you find this article helpful.

Leave a comment

Your email address will not be published. Required fields are marked *