- Valohai will download data files from your private cloud storage. Data can come from for example AWS S3, Azure Storage, GCP Cloud Storage, or a public source (HTTP/HTTPS).
- Valohai handles both authentications with your cloud storage and downloading, uploading, and caching data.
- This means that you don’t need to manage keys, authentication, and use tools like
boto3
,gsutils
, orBlobClient
in your code. Instead, you should always treat the data as local data.
- This means that you don’t need to manage keys, authentication, and use tools like
- All Valohai machines have a local directory
/valohai/inputs/
where all your datasets are downloaded to. Each input will have its own directory, for example,/valohai/inputs/images/
and/valohai/inputs/model/
. - Each step in your
valohai.yaml
can contain one or multiple input definitions and each input can contain one or multiple files. For example, in a batch inference step, you could have a trained model file and a set of images you want to run the inference on. - Each input in
valohai.yaml
can have a default value. These values can be overridden any time you run a new execution to for example change the set of images you want to run batch inference on.
Read files from /valohai/inputs/
Start by configuring inputs and updating your code to read the data from Valohai inputs, rather than directly from your cloud storage or local file.
Using Python and the valohai-utils toolkit
import valohai
# Define inputs available for this step and their default location
# The default location can be overriden when you create a new execution (UI, API or CLI)
my_data = {
'myinput': 's3://bucket/mydata.csv'
}
# Create a step 'train' with a set of inputs
valohai.prepare(step="train", default_inputs=my_data)
# Open the CSV file from Valohai inputs
with open(valohai.inputs("myinput").path()) as csv_file:
reader = csv.reader(csv_file, delimiter=',')
Generate or update your existing YAML file by running:
vh yaml step myfile.py
Using Python
Start by configuring inputs to your step in valohai.yaml
and updating your code to read the data from Valohai inputs, rather than directly from your cloud storage.
# Get the location of Valohai inputs directory
VH_INPUTS_DIR = os.getenv('VH_INPUTS_DIR', '.inputs')
# Get the path to your individual inputs file
# e.g. /valohai/inputs/<input-name>/<filename.ext>
path_to_file = os.path.join(VH_INPUTS_DIR, 'myinput/mydata.csv')
pd.read_csv(path_to_file)
Create a valohai.yaml configuration file and define your step in it:
- step:
name: train
image: tensorflow/tensorflow:2.6.1-gpu
command: python myfile.py
inputs:
- name: myinput
default: s3://bucket/mydata.csv
Using R
Start by configuring inputs to your step in valohai.yaml
and updating your code to read the data from Valohai inputs, rather than directly from your cloud storage.
# Get the location of Valohai inputs directory
vh_inputs_dir <- Sys.getenv("VH_INPUTS_DIR", unset = ".inputs")
# Get the path to your individual inputs file
# e.g. /valohai/inputs/<input-name>/<filename.ext>
path_to_file <- file.path(vh_inputs_dir, "myinput/mydata.csv")
import_df <- read.csv(path_to_file, stringsAsFactors = F)
Create a valohai.yaml configuration file and define your step in it:
- step:
name: train
image: tensorflow/tensorflow:2.6.1-gpu
command: python myfile.py
inputs:
- name: myinput
default: s3://bucket/mydata.csv
Access data from databases and data warehouses
You can also query data from sources like BigQuery, MongoDB, RedShift, Snowflake, BigTable, and other databases and data warehouses. These are not accessed through Valohai inputs but instead, you should run your existing code on Valohai to query from these data sources.
As database contents change over time, we recommend saving the query results as a file in Valohai outputs to make sure there is a snapshot of the query result, and you can later on easily reproduce your jobs with the exact same data.
Save files to /valohai/outputs/
- Any file(s) that you want to save, version, track, and access after the execution should be saved as Valohai outputs.
- Valohai will upload all files to your private cloud storage and version those files.
- Each output will be available under the executions outputs tab and in the project’s data tab. From there you can download the file, or copy the link to that file.
- When creating another execution you can pass in the
datum://
address of an output file, or use a cloud-specific address (i.e.s3://
,gs://
,azure://
)
An example on how you save plot figure to Valohai outputs and have it uploaded to your cloud storage:
Using Python and the valohai-utils toolkit
plt.savefig(valohai.outputs().path("plot.png"))
Using Python
plt.savefig("/valohai/outputs/plot.png")
Using R
# Get the location of Valohai outputs directory
vh_outputs_path <- Sys.getenv("VH_OUTPUTS_DIR", unset = ".outputs")
# Define a filepath in Valohai outputs directory
# e.g. /valohai/outputs/<filename.ext>
out_path <- file.path(vh_outputs_path, "mydata.csv")
write.csv(output, file = out_path)
Comments
0 comments
Please sign in to leave a comment.