Valohai executions are kept in the queued state until a machine becomes available to run the job. It's not uncommon to wait for 3-5 minutes for your cloud provider (AWS, GCP, Azure) to provision a new machine for execution.
An execution can be stuck in the queued state for a longer period of time because:
- Your organization's quota for that machine type is full and your cloud provider doesn't allow us to create any more parallel machines.
- Your Valohai administrator has defined a per-user quota that determines how many parallel machines can each user run at a time.
- The selected spot instance type is not available right now.
- The cloud provider is having availability issues with a specific machine or GPU type.
As soon as one machine frees up, the job will be picked up.
Your organization admin can go to the Manage organization / Environments page and open the status tab of the right environment to see any possible scaling errors from your cloud providers.
Comments
0 comments
Please sign in to leave a comment.