Most of your Valohai machine learning jobs have a set of input files that are downloaded from a cloud object storage like AWS S3, Azure Blob Storage, or GCP Cloud Storage. This data is downloaded on each of the virtual machines that are running your jobs.
-
By default, each Valohai worker (virtual machine) will have its own cache where the downloaded data is stored.
-
When the machine is no longer used (after a configurable grace period) it gets scaled down, and with it, the local cache gets removed.
-
The next time a machine gets scaled up it will download the input files to its own cache.
Valohai has the option to set up a shared network cache between several worker machines.
In this case, the input data is stored on an NFS or SMB network mount from where the workers can fetch the data, instead of always re-downloading the data from cloud storage.
This is for example useful when:
-
You have large datasets (50GB+) that you access often from different workers.
-
You’re running Valohai Tasks where you have multiple parallel (GPU) instances that download the same dataset from a cloud object storage.
-
You have TBs of data that takes a long time to download from your object storage.
Comments
0 comments
Please sign in to leave a comment.