Enhanced autoscale capabilities in HDInsight clusters

sairamyeturi · ‎May 10 2023

Today, we announce HDInsight Autoscale with enhanced capabilities which include improved latency and support for recommissioning node managers in case of load-based autoscale. Together, they improve cluster utilization massively and lower the total cost of ownership (TCO) significantly. Additionally, we also introduce customizable parameters on Autoscale to meet customer needs and bring in more flexibility to tune based on customer preferences.

Note: Effective 17^th May 2023, the enhanced autoscale capabilities are available for HDInsight customers on supported workloads. HDInsight Autoscale currently supports various cluster shapes and was released for general availability on November 7th, 2019.

Improved Latency

Autoscale feature helps customers to leverage elasticity on cloud, and HDInsight autoscale now offers significant improvements in scale-up and scale-down latencies for load-based and schedule based autoscaling. In the enhanced version, the average latency for scaling has been cut down nearly by 3x. With enhanced Autoscale, HDInsight utilizes a new workflow which is fast and reliable, this aids to the improved provisioning workflows, making scaling decisions more effective.

The below numbers indicate the latency improvements with enhanced autoscale. They account for scaling 50 nodes (Spark) cluster:

Scaling	Cluster Type	Old	Enhanced
Scale Up	ESP - Spark	~29 mins	~ 10 mins
Scale Up	Non-ESP - Spark	~25 mins	~ 7 mins
Scale Down	ESP - Spark	~ 15 mins	~ 4 mins
Scale Down	Non-ESP - Spark	~ 11 mins	~ 0.5 mins

Note: Above latency numbers are measured on clusters without custom script actions. These results are just representative of the improvements and actual numbers may vary subjective to customization done by each customer however significant improvements can be observed.

Recommissioning of nodes before scale-up

HDInsight autoscale has introduced support for recommissioning nodes before provisioning new nodes. If there are nodes in decommissioning state waiting to gracefully decommission, autoscale will select those nodes (as per requirement) and recommission them so that they can be utilized to share the increased load. Cluster load is re-evaluated after a cool down period and if needed new nodes are added. This feature significantly reduces the time to add new cluster capacity (as recommission takes seconds). This feature is available in load-based autoscale on supported cluster shapes (Spark and Hive) that used YARN for resource management.

Note: Spark Streaming support with Autoscale is on the roadmap.

How to custom configure?

Following are few configurations that can be tuned to custom configure HDInsight Autoscale as per customer needs.

This is applicable for 4.0 and 5.0 stacks.

Configuration	Description	Default value	Applicable cluster/ Autoscale type	Remarks
yarn.4_0.graceful.decomm.workaround.enable	Enable YARN graceful decommissioning	Loadbased autoscale – True Scheduled autoscale - True	Hadoop/Spark	If this configuration is disabled, YARN puts nodes in Decommissioned state directly from Running state without waiting for the applications using the node to finish. This might lead to applications getting killed abruptly when nodes are decommissioned. Read more about job resiliency in YARN here
yarn.graceful.decomm.timeout	YARN graceful decommissioning timeout in seconds	Hadoop Loadbased – 3600 Spark loadbased – 86400 Spark Scheduled - 1 Hadoop Scheduled – 1	Hadoop/Spark	Graceful decommissioning timeout is best configured according to customer applications. For example – if an application has many mappers and few reducers which can take 4 hours to complete, this configuration needs to be set to more than 4 hours
yarn.max.scale.up.increment	Maximum number of nodes to scale up in one go	200	Hadoop/Spark/Interactive Query	It has been tested with 200 nodes. We do not recommend configuring this to more than 200. It can be set to less than 200 if the customer wants less aggressive scale up
yarn.max.scale.down.increment	Maximum number of nodes to scale up in one go	50	Hadoop/Spark/Interactive Query	Can be set to up to 100
nodemanager.recommission.enabled	Feature to enabled recommissioning of decommissioning NMs before adding new nodes to the cluster	True	Hadoop/Spark load based autoscaling	Disabling this feature can cause underutilization of cluster (there can be nodes in decommissioning state which have no containers running but are waiting for application to finish) even if there is more load in the cluster Note: Applicable for images on 2304280205 or later
UnderProvisioningDiagnoser.time.ms	Time in milliseconds for which cluster needs to under provisioned for scale up to trigger	180000	Hadoop/Spark load based autoscale
OverProvisioningDiagnoser.time.ms	Time in milliseconds for which cluster needs to be overprovisioned for scale down to trigger	180000	Hadoop/Spark load based autoscaling
hdfs.decommission.enable	Decommission datanodes before triggering decommissioning nodemanagers. HDFS does not support any graceful decommission timeout, it’s immediate	True	Hadoop/Spark	Decommissioning datanodes before decommissioning nodemanagers so that particular datanode is not used for storing shuffle data
scaling.recommission.cooldown.ms	Cooldown period after recommission during which no metrics are sampled	120000	Hadoop/Spark load based autoscaling	This cooldown period ensures the cluster has some time to re-distribute the load to the newly recommissioned nodemanagers Note: Applicable for images on 2304280205 or later
scale.down.nodes.with.ams	Scale down nodes where an AM is running	false	Hadoop/Spark	Can be turned on if there are enough reattempts configured for the AM. Useful for cases where there are long running applications (example spark streaming) which can be killed for scaling down cluster if load has reduced Note: Applicable for images on 2304280205 or later

Note

The above configs can be changed using this script run on the headnodes as a script action, please use this readme to understand how to run the script.
Customers are advised to test the configurations on lower environments before moving to production.
How to check image version - https://learn.microsoft.com/en-us/azure/hdinsight/view-hindsight-cluster-image-version

Other Improvements

Master services restart: In the past, Autoscale would restart master services (Hive, Resource manager, etc.) during scale operation to account for all nodes. This was a shortcoming in YARN. This is fixed in the current release. We no longer restart any services during scale operation, thereby, ensuring smooth operation for running applications.
Decommissioning nodes in YARN: Nodes in decommissioning state become untracked by RM when RM fails over [Reference- YARN-10896], Due to this open issue, when Autoscale checks for decommissioned nodes, it was missing these nodes, and these worker nodes became zombies. During scale down autoscale was not able to remove them automatically, in our latest release we fixed this issue on HDInsight.
Zombie Nodes: Several improvements have been made to reduce the zombie node occurrences, HDInsight has enhanced the dependability of Ambari's response to the startup operation, which is responsible for handling the startup of components.
ESP Clusters: The Kerberos principal clean up during scale down operation has been improved, which will prevent issues such as 100% CPU spikes and authentication impediments with AADDS.

Best Practices for HDInsight Autoscale

Ambari DB & Head Node Sizing: Always Size your Ambari DB & Head Node for ESP and NonESP to optimally benefit from Autoscale.
Script actions: In case you are using script actions, during cluster provisioning, the script runs concurrently with other setup and configuration processes. Competition for resources such as CPU time or network bandwidth might cause the script to take longer to finish than it does in your development environment. To minimize the time it takes to run the script, avoid tasks like downloading and compiling applications from the source. Precompile applications and store the binary in Azure Storage. In general, optimize the script actions to have faster scaling latency. It is observed that clusters without script actions scale faster than the ones with script actions.
AADDS: In case of an ESP cluster, you can choose to sync only the groups that need access to the HDInsight clusters. This option of syncing only certain groups is called scoped synchronization. For instructions, see Configure scoped synchronization from Azure AD to your managed domain. It is recommended to use scoped sync to have optimal impact on scaling if you have aggressive scaling needs on your cluster.

Moving to Enhanced Autoscale

If you are having an Existing Cluster - Using Autoscale [Image version has 2205241602 or later]

Perform Disable/Enable Autoscale
Run a script action once enabled, and it moves to Enhanced Autoscale

Any New Cluster Creation post 17^th May 2023 [Image version contains 2304280205 or later]

Enable Autoscale

Note:

To experience recommissioning benefits, we recommend recreating cluster with latest image [2304280205 or later]
How to check image version - https://learn.microsoft.com/en-us/azure/hdinsight/view-hindsight-cluster-image-version
Refer release notes - Release notes for Azure HDInsight | Microsoft Learn

FAQs

How does YARN Graceful Decommissioning work and what are its benefits?

To achieve full elasticity in YARN, we need a decommissioning process which helps to remove existing nodes and down-scale the cluster. There are two ways to decommission a nodemanager – NORMAL or GRACEFUL.

Normal decommissioning means an immediate shutdown.
Graceful decommissioning is the mechanism to decommission NMs while minimizing impact to the running applications. Once a node is in DECOMMISSIONING state, RM won’t schedule new containers on it and will wait for running containers and applications to complete (or until decommissioning timeout exceeded) before transition the node into DECOMMISSIONED.

Nodes once DECOMMISSIONED will be removed from the HDInsight cluster.

How does autoscale work if I configure multiple YARN queues?

HDInsight autoscale is not queue-aware. For scaling up, it considers only the total available capacity vs total pending capacity. Consider this scenario – There are 4 queues configured, each with 25% max capacity. If only one of the queue is running beyond capacity and there is no load on the other queues, the total pending capacity might not be enough to add new node to the cluster as total pending capacity might still be less than total available capacity.
Customer should consider enough elasticity in yarn queue configurations to consider for these kinds of scenarios.

Why spark/hive job takes longer to complete when autoscale is configured?

If graceful timeout is configured to be too small, it might cause early exit of the containers running on a decommissioning node, thereby, triggering retries.

Limitations

Graceful decommissioning is not persistent across RM restarts

If RM restarts (or fails over), the decommissioning nodes will get shutdown immediately without honoring the graceful decommissioning timeout; this is an open YARN limitation.