Horovod is a distributed deep learning training framework that helps make the learning process faster and easier.
Horovod is great when it comes to running training in a job on preemptible or spot instances, as there's a dedicated elastic mode for cases like this. Elastic training enables Horovod to scale the number of workers up and down dynamically at runtime.
To run the main training node, execute the following command:
$ neuro-flow run main
This job will additionally wait for 600 seconds for the secondary nodes to appear (see "--start-timeout 600" in the main job's bash section).
Launching secondary nodes
Execute this command a few times to spawn worker nodes:
$ neuro-flow run secondary
For this example, at least two workers are needed (see "--num-proc 2" in the main job's bash description). All of these worker nodes will be connected to the Horovod instance.
You could also update the number of secondary nodes during the training process. Each of them will be connected and synchronized with the main training process - this is the main feature of Horovod in elastic mode.