By default, a Valohai pipeline will stop if there are any nodes that error.
If an execution inside a Task node errors, it will cause the whole node to error. You can change this behavior on Task nodes by defining a different on-error behavior.
The options are:
stop-all: This is the default behavior. If one execution is the Task node fails the whole node will be errored and the pipeline stopped.
continue: Continue executing the Task node, even if an execution inside the Task errors. The expectation is that at least one of the executions in the Task has been completed successfully.
stop-next: Stops only the nodes that follow the errored node.
The below example shows a pipeline with two parallel task nodes.
- train is defined with on-error: stop-next
- train2 is defined with on-error: continue
Each of the task nodes run 2 executions, and in each of them, one of the executions fails. Using the
on-error rules defined in the
valohai.yaml the pipeline won't execute the evaluate node because the train node had one failed execution. But
evaluate2 will be executed because of
on-error of train2 is set to
valohai.yaml used for the pipeline looks like:
- pipeline: name: Training Pipeline nodes: - name: preprocess type: execution step: preprocess-dataset - name: train type: task on-error: stop-next step: train-model override: inputs: - name: dataset - name: evaluate type: execution step: batch-inference - name: train2 type: task on-error: continue step: train-model override: inputs: - name: dataset - name: evaluate2 type: execution step: batch-inference edges: - [preprocess.output.preprocessed_mnist.npz, train.input.dataset] - [preprocess.output.preprocessed_mnist.npz, train2.input.dataset] - [train.output.model*, evaluate.input.model] - [train2.output.model*, evaluate2.input.model]