- Valohai will download data files from your private cloud storage. Data can come from for example AWS S3, Azure Storage, GCP Cloud Storage, or a public source (HTTP/HTTPS).
- Valohai handles both authentications with your cloud storage and downloading, uploading, and caching data.
- This means that you don’t need to manage keys, authentication, and use tools like
boto3
,gsutils
, orBlobClient
in your code. Instead, you should always treat the data as local data.
- This means that you don’t need to manage keys, authentication, and use tools like
- All Valohai machines have a local directory
/valohai/inputs/
where all your datasets are downloaded to. Each input will have its own directory, for example,/valohai/inputs/images/
and/valohai/inputs/model/
. - Each step in your
valohai.yaml
can contain one or multiple input definitions and each input can contain one or multiple files. For example, in a batch inference step, you could have a trained model file and a set of images you want to run the inference on. - Each input in
valohai.yaml
can have a default value. These values can be overridden any time you run a new execution to for example change the set of images you want to run batch inference on.
Read files from /valohai/inputs/
Start by configuring inputs to your step in valohai.yaml
and updating your code to read the data from Valohai inputs, rather than directly from your cloud storage.
# Get the location of Valohai inputs directory
VH_INPUTS_DIR = os.getenv('VH_INPUTS_DIR', '.inputs')
# Get the path to your individual inputs file
# e.g. /valohai/inputs/<input-name>/<filename.ext>
path_to_file = os.path.join(VH_INPUTS_DIR, 'myinput/mydata.csv')
pd.read_csv(path_to_file)
Create a valohai.yaml configuration file and define your step in it:
- step:
name: train
image: tensorflow/tensorflow:2.6.1-gpu
command: python myfile.py
inputs:
- name: myinput
default: s3://bucket/mydata.csv
Access data from databases and data warehouses
You can also query data from BigQuery, MongoDB, RedShift, Snowflake, BigTable, and other databases and data warehouses. These are not accessed through Valohai inputs but instead, you should run your existing code on Valohai to query from these data sources.
As database contents change over time, we recommend saving the query results as a file in Valohai outputs to make sure there is a snapshot of the query result, and you can later on easily reproduce your jobs with the same data.
Save files to /valohai/outputs/
- Any file(s) that you want to save, version, track, and access after the execution should be saved as Valohai outputs.
- Valohai will upload all files to your private cloud storage and version those files.
- Each output will be available under the executions outputs tab and in the project’s data tab. From there you can download the file, or copy the link to that file.
- When creating another execution you can pass in the
datum://
address of an output file, or use a cloud-specific address (i.e.s3://
,gs://
,azure://
)
import os
outputs_dir = os.getenv("VH_OUTPUTS_DIR")
save_path = os.path.join(outputs_dir, "mymodel.pkl")
model.save_model(save_path)
Comments
0 comments
Please sign in to leave a comment.