By default, a Valohai pipeline will stop if there are any nodes that error.
If an execution inside a Task node errors, it will cause the whole node to error. You can change this behavior on Task nodes by defining a different on-error behavior.
The options are:
stop-all
: This is the default behavior. If one execution is the Task node fails the whole node will be errored and the pipeline stopped.continue
: Continue executing the Task node, even if an execution inside the Task errors. The expectation is that at least one of the executions in the Task has been completed successfully.stop-next
: Stops only the nodes that follow the errored node.
On-error example
The below example shows a pipeline with two parallel task nodes.
- train is defined with on-error: stop-next
- train2 is defined with on-error: continue
Each of the task nodes run 2 executions, and in each of them, one of the executions fails. Using the on-error
rules defined in the valohai.yaml
the pipeline won't execute the evaluate node because the train node had one failed execution. But evaluate2
will be executed because of on-error
of train2 is set to continue
.
The valohai.yaml
used for the pipeline looks like:
- pipeline:
name: Training Pipeline
nodes:
- name: preprocess
type: execution
step: preprocess-dataset
- name: train
type: task
on-error: stop-next
step: train-model
override:
inputs:
- name: dataset
- name: evaluate
type: execution
step: batch-inference
- name: train2
type: task
on-error: continue
step: train-model
override:
inputs:
- name: dataset
- name: evaluate2
type: execution
step: batch-inference
edges:
- [preprocess.output.preprocessed_mnist.npz, train.input.dataset]
- [preprocess.output.preprocessed_mnist.npz, train2.input.dataset]
- [train.output.model*, evaluate.input.model]
- [train2.output.model*, evaluate2.input.model]
Comments
0 comments
Please sign in to leave a comment.