Recap on datums (single files)
Valohai uses the datum identifiers to track the files that have been saved as execution outputs. By using the datum://<datum-id>
syntax it is then possible to use the identifiers to define inputs for other executions in the same same project or in a project that shares a data store. You can also set aliases for the datums to easily access the latest version of your model without needing to change your code.
In addition to files saved as outputs from the executions also those uploaded to Valohai via the UI will get a datum identifier. For files in AWS S3 storage you can also use datum adoption to create the id without having to separately upload the files or run them through an execution.
Working with datasets
The datum URLs and datum aliases are useful when you have just one file. However, if you have for example extracted and preprocessed a larger set of data and would like to use that as an input in another execution, listing the datum URLs is not very convenient. Moreover, if part of the data is changed later, you will need remember to update the inputs with the new datum URLs.
Datasets is a feature in Valohai that makes it easier to work with and track collections of files. You can use them as inputs in your executions, create new versions and aliases and keep track of changes made to the dataset.
Creating a dataset
Web UI
To create a dataset, follow the steps below.
- Go to the Data tab of your project
- Select the Dataset tab
- Click on the Create dataset button
- Choose a Name and Owner for the dataset
- To share the dataset with your team, mark your organization as the owner.
- Click on the Create button
Creating a dataset version
Web UI
To add data to your dataset, you will need to create a new version.
- Click on the dataset name
- Click on the Create new version button
- Choose one or more datums to add to the dataset.
- You can for filter the list by for example filename, tags or data store.
- Click on the Add or Add Selected button
- Give a name to the dataset version
- Click on the Save new version
- Until you click the save button, you can freely add and remove datums to the dataset
Note that it is not possible to edit dataset versions after creation. However, it is possible to use an existing dataset version as the base for a new one. Just click on the three dots in the Dataset Versions table and choose "Create new version from this version". You can both add and remove files from the new version.
Programatically
You can also create a new dataset version from your execution outputs by adding the .metadata.json
sidecar file to them.
import valohai
import json
metadata = {
"valohai.dataset-versions": ["dataset://<dataset-name>/<dataset-version-name>"]
}
save_path = valohai.outputs().path('model.h5')
model.save(save_path)
metadata_path = valohai.outputs().path('model.h5.metadata.json')
with open(metadata_path, 'w') as outfile:
json.dump(metadata, outfile)
If the dataset, here define as <dataset-name>
, does not exists, a new one will be created.
Similarly to creating new versions based on existing ones in the UI, you can do that also programatically with the .metadata.json
sidecar file.
To add files to an existing dataset, use the following metadata
definition:
metadata = {
"valohai.dataset-versions": [{
'uri': "dataset://<dataset-name>/<new-dataset-version-name>",
'from': "dataset://<dataset-name>/<original-dataset-version-name>",
}]
}
To create a new dataset version by removing files from an existing one, use the syntax shown below.
metadata = {
"valohai.dataset-versions": [{
'uri': "dataset://<dataset-name>/<new-dataset-version-name>",
'from': "dataset://<dataset-name>/<original-dataset-version-name>",
'start_fresh': False,
'exclude': ['exclude1.csv', 'exclude2.csv']
}]
}
- You should give the filenames of the datums to be excluded in a list.
start_fresh: false
means that all the datums except for those listed in exclude will be included in the new dataset version.start_fresh: true
will exclude all the files from the original dataset. The advantage is that you can then set the original dataset as the previous version of the new dataset without having to include any of its datums.- This implies that a dataset version can have several Next Versions.
By using the API
Check the API basics documentation to learn how to work with the Valohai APIs.
You can create a new dataset version by sending a POST request to https://app.valohai.com/api/v0/dataset-versions/. The dataset, version name and files to include are defined in the request body.
{
"name": "<version-name>",
"dataset": "<dataset-UUID>",
"files": [
{"datum": "<datum-UUID>"},
{"datum": "<datum-UUID>"},
{"datum": "<datum-UUID>"}
]
}
Using datasets as inputs
Similarly to the datum://
links, you can use the dataset://
URLs as inputs in your execution. Instead of a single file, all the files in the dataset will be uploaded under the specified input. Below is an example of how using dataset as an input would look like in your valohai.yaml
but you can of course use it also in the URL field in the UI as well.
inputs:
- name: dataset
default: dataset://<dataset-name>/<dataset-version-name>
optional: false
Instead of the dataset and version names, you can also use the respective UUIDs.
Dataset version alias
You can create aliases for your datasets to avoid having to change your code every time you create a new version. Instead of using the version name, you can refer to it by using the alias. E.g. instead of using dataset://<dataset-name>/<dataset-version-name>
in the input, you could write dataset://<dataset-name>/myalias
.
Note that the alias dataset://<dataset-name>/latest
will always point to the latest version in your dataset. You don't need to create this alias yourself.
Web UI
To create an alias in the UI open the Aliases tab under the dataset and click on the Create new dataset version alias button.
To set the alias to point to another dataset version, click on the Edit button. You can track the alias history in the UI.
Programatically
The .metadata.json
sidecar file can be used to create or update the dataset version alias. The example below will create a new dataset version based on a previous version and then set the alias myalias
to point to that new version. If the alias does not exist, it will be created.
metadata = {
"valohai.dataset-versions": [{
'uri': "dataset://<dataset-name>/<new-dataset-version-name>",
'from': "dataset://<dataset-name>/<original-dataset-version-name>",
'targeting_aliases': ['myalias']
}]
}
You instead of using the alias name, you can also use the full URI, i.e. dataset://<dataset-name>/myalias
instead of myalias
.
Comments
0 comments
Please sign in to leave a comment.