When a task is created as Distributed, the following happens:
- Valohai creates and queues as many identical executions as described by the task.
- Each of these executions will wait for the rest the group to become online and waiting.
- After the whole group is ready, the members will share connection information between each other. This describes such things as local network addresses or how to connect to each other.
- This collective communication information is written to
/valohai/config/distributed.yamlon each worker.
- Project code is expected to read this information and establish connections as their tooling needs.
- Then each execution will keep on running until:
- somebody manually stops the execution or the task
- the execution code quits
- the task automatically stops the execution e.g. because of a task "On Error" rule like "if one worker gets errors, stop all workers"
The publicly available
valohai-utils Python package contains helpers under
valohai.distributed to use the shared information and to assign unique ranks to workers, but you are also free to parse and use the configuration files how you see fit.
One of the workers will be marked as the master for convenience if
valohai-utils distributed helpers are used but that abstraction is optional. It's simply the first worker that announced itself.
Each execution in a distributed task has their separate input downloading, output uploading and metadata recording like normal. It's common to gather and upload final results from one of the workers (like the master discussed earlier) but this depends on the workload being ran.