Running Elastic Horovod Training
Horovod is a distributed deep learning training framework that helps make the learning process faster and easier.
Horovod is great when it comes to running training in a job on preemptible or spot instances, as there's a dedicated elastic mode for cases like this. Elastic training enables Horovod to scale the number of workers up and down dynamically at runtime.
We created a script for running Horovod with TensorFlow based on the official TensorFlow tutorial and extended it with the help of this Horovod tutorial.
Here's how you can use it with Neu.ro.
$ git clone https://github.com/neuro-inc/mlops-horovod-example.git
Once it's cloned, switch to its local root folder:
$ cd <local-repo-path>
Then, generate a SSH key to allow SSH-based communication between jobs:
$ ssh-keygen -t rsa -b 4096 -f ssh-keys/id_rsa -q -N ""
Now, store the private part of the SSH key. It will be used by the main node to coordinate secondary nodes:
$ neuro secret add horovod-id-rsa @ssh-keys/id_rsa
Store the public part of the SSH key which will be used by the secondary nodes to validate the main node:
$ neuro secret add horovod-id-rsa-pub @ssh-keys/id_rsa.pub
To run the main training node, execute the following command:
$ neuro-flow run main
This job will additionally wait for 600 seconds for the secondary nodes to appear (see "--start-timeout 600" in the
main
job's bash
section).Execute this command a few times to spawn worker nodes:
$ neuro-flow run secondary
For this example, at least two workers are needed (see "--num-proc 2" in the
main
job's bash
description). All of these worker nodes will be connected to the Horovod instance. You could also update the number of secondary nodes during the training process. Each of them will be connected and synchronized with the main training process - this is the main feature of Horovod in elastic mode.
Last modified 2yr ago