Valohai inputs are data files that are fetched from your private cloud storage and available during the execution.
Data can come from for example AWS S3, Azure Storage, GCP Cloud Storage, or a public source (HTTP/HTTPS).
Valohai handles both authentications with your cloud storage and downloading, uploading, and caching data. This means that you don’t need to manage keys, authentication, and use tools like boto3
, gsutils
, or BlobClient
in your code. Instead, you should always treat the data as local data.
Where are the input files saved?
All Valohai machines have a local directory (e.g. /valohai/inputs/
) where all your datasets are downloaded to. Each input will have its own directory, for example, /valohai/inputs/images/
and /valohai/inputs/model/
.
Each Valohai execution has an environment variable VH_INPUTS_DIR
that stores the location of the inputs directory on that worker.
How are inputs defined?
Each step in your valohai.yaml
can contain one or multiple input definitions and each input can contain one or multiple files. For example, in a batch inference step, you could have a trained model file and a set of images you want to run the inference on.
Each input in valohai.yaml
can have a default value. These values can be overridden any time you run a new execution to for example change the set of images you want to run batch inference on.
How can I add inputs from my private cloud data store?
You can connect private data stores to Valohai projects.
If you connect a store that contains files that Valohai doesn’t know about, like the files that you have uploaded there yourself, you can use the following syntax to refer to the files.
- Azure Blob Storage:
azure://{account_name}/{container_name}/{blob_name}
- Google Storage:
gs://{bucket}/{key}
- Amazon S3:
s3://{bucket}/{key}
- OpenStack Swift:
swift://{project}/{container}/{key}
This syntax also has supported wildcard syntax to download multiple files:
s3://my-bucket/dataset/images/*.jpg
for all .jpg (JPEG) filess3://my-bucket/dataset/image-sets/**.jpg
for recursing subdirectories for all .jpg (JPEG) files
You can also interpolate execution parameters into input URIs:
s3://my-bucket/dataset/images/{parameter:user-id}/*.jpeg
would replace{parameter:user-id}
with the value of the parameteruser-id
during an execution.
Each file that you output from an execution will be uploaded to your private data store and receive a datum-link that you can use as an input in another file.
Define an input in valohai-utils
import valohai
# Define inputs available for this step and their default location
# The default location can be overriden when you create a new execution (UI, API or CLI)
default_inputs = {
'myinput': 's3://bucket/mydata.csv'
}
# Create a step 'train' in valohai.yaml with a set of inputs
valohai.prepare(step="train", image="tensorflow/tensorflow:2.6.1-gpu", default_inputs=default_inputs)
# Open the CSV file from Valohai inputs
with open(valohai.inputs("myinput").path()) as csv_file:
reader = csv.reader(csv_file, delimiter=',')
Generate or update your existing valohai.yaml
file with:
vh yaml step myfile.py
Define an input in Python without valohai-utils
# Get the location of Valohai inputs directory
VH_INPUTS_DIR = os.getenv('VH_INPUTS_DIR', '.inputs')
# Get the path to your individual inputs file
# e.g. /valohai/inputs/<input-name>/<filename.ext<
path_to_file = os.path.join(VH_INPUTS_DIR, 'myinput/mydata.csv')
pd.read_csv(path_to_file)
Create a valohai.yaml configuration files and define your step in it:
- step:
name: train
image: tensorflow/tensorflow:2.6.1-gpu
command: python myfile.py
inputs:
- name: myinput
default: s3://bucket/mydata.csv
Define an input in R
# Get the location of Valohai inputs directory
vh_inputs_dir <- Sys.getenv("VH_INPUTS_DIR", unset = ".inputs")
# Get the path to your individual inputs file
# e.g. /valohai/inputs/<input-name>/<filename.ext>
path_to_file <- file.path(vh_inputs_dir, "myinput/mydata.csv")
import_df <- read.csv(path_to_file, stringsAsFactors = F)
Create a valohai.yaml configuration file and define your step in it:
- step:
name: train
image: tensorflow/tensorflow:2.6.1-gpu
command: python myfile.py
inputs:
- name: myinput
default: s3://bucket/mydata.csv
Comments
0 comments
Please sign in to leave a comment.