Introducing HorovodRunner for Distributed Deep Learning Training

by Yogesh Garg, Lu Wang, Bagrat Amirbekian, Joseph Bradley and Jules Damji

November 19, 2018 in Engineering Blog

Share this post

Today, we are excited to introduce HorovodRunner in our Databricks Runtime 5.0 ML!

HorovodRunner provides a simple way to scale up your deep learning training workloads from a single machine to large clusters, reducing overall training time.

Motivated by the needs of many of our users who want to train deep learning models on datasets that do not fit on a single machine and reduce overall training time, HorovodRunner addresses this requirement by distributing training across your clusters, hence processing more data per second and decreasing the training time from hours to minutes.

As part of an effort to integrate distributed deep learning with Apache Spark, leveraging Project Hydrogen, HorovodRunner utilizes barrier execution mode introduced in Apache Spark 2.4. This new model of execution is different from the generic Spark execution model and is catered to distributed training in its fault-tolerance needs and modes of communication between tasks on each worker node in the cluster.

In this blog, we describe HorovodRunner and how you can use HorovodRunner’s simple API to train your deep learning model in a distributed fashion, letting Apache Spark handle all the coordination and communication among tasks on each worker node in the cluster.

HorovodRunner’s Simple API

Horovod, Uber’s open source distributed training framework, supports TensorFlow, Keras, and PyTorch. HorovodRunner, built on top of Horovod, inherits the support of these deep learning frameworks and makes it much easier to run.

Under the hood, HorovodRunner shares code and libraries across machines, configures SSH, and executes the complicated MPI commands required for distributed training.

As result, data scientists are freed from the burden of operational requirements and can now focus on tasks at hand—building models, experimenting, and deploying them to production quickly.

Also, HorovodRunner provides a simple interface that allows you to easily distribute your workloads on a cluster. For example, the snippet below runs the train function on 4 worker machines. This can help you achieve good scaling of your workloads, accelerate model experimenting, and shorten the time to production.

from sparkdl import HorovodRunner
hr = HorovodRunner(np=4)
hr.run(train, batch_size=512, epochs=5)

The train method below contains the Horovod training code. The sample code outlines the small changes to your single-node workloads to use Horovod. With a few lines of code changes and using HorovodRunner, you can start leveraging the power of a cluster in a matter of minutes.

import horovod.keras as hvd
import keras

def train(batch_size=512, epochs=12):

  # initialize horovod here
  hvd.init()

  model = get_model()

  # split your training and testing data based on
  # Horovod rank and size
  (x_train, y_train), (x_test, y_test) = get_data(hvd.rank(), hvd.size())
  opt = keras.optimizers.Adadelta()

  # Overwrite your optimizer with Horovod Distributed Optimizer
  opt = hvd.DistributedOptimizer(opt)

  # compile your model
  model.compile(loss=keras.losses.categorical_crossentropy,optimizer=opt, metrics[‘accuracy’])

  # fit the model
  model.fit(x_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    verbose=2,
    validation_data=(x_test, y_test))

Integrated Workflow on Databricks

HorovodRunner launches Horovod training jobs as Spark jobs. So your development workflow is exactly the same as other Spark jobs on Databricks. For example, you can check training logs from Spark UI as shown in the animated Fig 1. below.

https://www.youtube.com/watch?v=tlbK4ODANZU

Or you can just as easily trace the error back to a notebook cell and code as shown in animated Fig 2. here.

https://www.youtube.com/watch?v=3uQPrfMKEog

Tools like TensorBoard and Horovod Timeline are also supported within Databricks.

Get started!

To get started, check out example notebooks to classify MNIST dataset using TensorFlow, Keras, or PyTorch in Databricks Runtime 5.0 ML! To migrate your single node workloads to a distributed setting, you can simply follow the steps outlined in this documentation.

Try Databricks today with Apache Spark 2.4 and Databricks Runtime 5.0.

Learn about Barrier Execution in Apache Spark 2.4
Read about HorovodRunner Documentation
Listen to the Project Hydrogen: State of the Art Deep Learning on Apache Spark webinar
Find out more about Databricks Runtime 5.0 ML

Get more info on Horovod and HorovodEstimator

Try Databricks for free

Get Started

See all Engineering Blog posts

HorovodRunner’s Simple API

Integrated Workflow on Databricks

Get started!

Read More