Spot Instances/VMs work like any other execution environment in Valohai. You can choose a spot instance machine type from the drop down menu.
What are spot instances?
Spot instances are unused virtual machines that cloud providers offer for a cheaper price. For example, AWS isn't always using 100% of their p2.xlarge capacity so they offer unused instances at a discounted price.
Spot instances can be a cost-effective choice for running workloads that can be interrupted.
The downside of spot instances is that your cloud provider might decide that it requires the spot instance for something else and interrupt your job. In this case you'll get a keyboard interrupt signal to your code and Valohai will stop running the job, as the machine will be used for something else.
Choosing a spot instance type on Valohai
You can run any workload on a spot instance. Just select the right environment when launching your execution from the UI, API, or CLI.
- You can choose to show only spot instance types.
- The "slug" of the environment. This is used to specify the environment when running from CLI (
vh exec run train --adhoc --environment aws-eu-west-1-t3-medium-spot
) or from the API ("environment": "aws-eu-west-1-t3-medium-spot"
) - Auto restart will queue the job again after it's interrupted. You'll be able to access the outputs of the interrupted execution by using the
_restart
input that is added automatically to the new job.
Once you launch a job using a spot instance environment:
- Valohai schedule the job and try to get a spot instance. If one is not immediately available it will keep trying until it gets one.
- Your job might end on time, or it might be interrupted by the cloud provider saying "this is instance isn't available any more for this price".
- In this case your execution will get a notification (a KeyboardInterrupt that you can react to in your code) and then you have a couple of mins to wrap up your process before the machine is terminated by the provider.
- In your code, you can expect a KeyboardInterrupt error as a signal that your job is stopping.
We strongly suggest that you use Live Outputs to save your checkpoints and other files as soon as you create them. When a spot instance gets interrupted you might not have enough time to upload a large amount of files to the cloud so it's better to upload them during the normal execution run.
Automatic Restart on Spot Instance Interruption
You can configure each execution to automatically restart if it gets interrupted by a cloud provider. This allows you to requeue the job and continue the job when another spot instance becomes available.
- During your execution make sure you upload your checkpoints and files as Live Outputs.
- When a job gets interrupted you'll receive a Keyboard Interrupt and you'll see a message in the logs: "Spot instance interruption found. Wrapping up here!"
- Valohai will stop the existing job and queue a new job.
- The new job will have a special input defined called
_restart
all of the outputs of the previous execution will be available in this input directory. You can access them like any other inputs in Valohai to choose your latest checkpoint and continue you work.- Note: When a spot instance gets terminated the disk is also removed. A restarted execution starts off "fresh" so it will be up to your code to check if there are any checkpoints in
_restart
inputs and pick up from where it left off.
- Note: When a spot instance gets terminated the disk is also removed. A restarted execution starts off "fresh" so it will be up to your code to check if there are any checkpoints in
Testing the automatic restarting
To test that your code works in case the spot instance is interrupted you can use the tools from your cloud provider.
- In AWS you can use the Fault Injection simulator (FIS) to create an experiment that will stop your spot instance.
- In GCP you can simulate a host maintenance event
gcloud auth login gcloud compute instances simulate-maintenance-event <MACHINE-ID> --zone <ZONE>
Spot instance pricing and quotas
AWS
Spot Instance pricing on AWS is dynamic and will adjust based on supply and demand for Spot Instance capacity. Each AWS environment in Valohai has a "max price" setting which is the max hourly rate you're willing to pay. By default we keep this as the on-demand instance price.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html
AWS has also limits on how many running and requested spot instances there can be per account in one Region. See the AWS documentation for more information and instructions how to increase the limits if needed.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-limits.html
Google Cloud
Google Cloud Platform has a fixed price for Spot VMs with no more than once-a-month pricing changes.
https://cloud.google.com/spot-vms
Similarly to any other VMs in GCP, you will need CPU, disk, and GPU quota for spot instances. It is recommended to request preemptible quota to for spots to avoid consuming your standard quotas.
https://cloud.google.com/compute/docs/instances/spot#quotas
Microsoft Azure
Pricing on Azure Spot Virtual Machines is variable, based on the region and machine type. Microsoft has made pricing available on their website.
https://docs.microsoft.com/en-us/azure/virtual-machines/spot-vms
Azure has separate vCPU quotas for Spot and Standard Virtual Machines.
https://docs.microsoft.com/en-us/azure/quotas/per-vm-quota-requests
Comments
0 comments
Please sign in to leave a comment.