Running Elastic Horovod Training
Horovod is great when it comes to running training in a job on preemptible or spot instances, as there's a dedicated elastic mode for cases like this. Elastic training enables Horovod to scale the number of workers up and down dynamically at runtime.
Here's how you can use it with Neu.ro.
$ git clone https://github.com/neuro-inc/mlops-horovod-example.git
Once it's cloned, switch to its local root folder:
$ cd <local-repo-path>
Then, generate a SSH key to allow SSH-based communication between jobs:
$ ssh-keygen -t rsa -b 4096 -f ssh-keys/id_rsa -q -N ""
Now, store the private part of the SSH key. It will be used by the main node to coordinate secondary nodes:
$ neuro secret add horovod-id-rsa @ssh-keys/id_rsa
Store the public part of the SSH key which will be used by the secondary nodes to validate the main node:
$ neuro secret add horovod-id-rsa-pub @ssh-keys/id_rsa.pub
To run the main training node, execute the following command:
$ neuro-flow run main
This job will additionally wait for 600 seconds for the secondary nodes to appear (see "--start-timeout 600" in the
Execute this command a few times to spawn worker nodes:
$ neuro-flow run secondary
For this example, at least two workers are needed (see "--num-proc 2" in the
bashdescription). All of these worker nodes will be connected to the Horovod instance.
You could also update the number of secondary nodes during the training process. Each of them will be connected and synchronized with the main training process - this is the main feature of Horovod in elastic mode.