The most convenient way to refer to the dataset versions are the dataset version aliases. However, dataset versions are used to track changes in your dataset and consequently each of them requires a unique name. You can freely choose the names and naming conventions that suit your use cases the best.
This example will show you how to use the project name and the execution ID as the dataset version name when creating the dataset programmatically. Moreover, the example contains instructions how to update the version alias to point to the just created dataset version.
latest
exists for each dataset and always points to the latest version automatically. Find the project name and the execution ID
Each Valohai execution contains a set of configuration files. You can use the executions.json file to get the both the project name and the execution ID. The project name contains also the organization name separated by the "/" character, which is not allowed in the dataset or dataset version names. Make sure to remove that!
import valohai
import json
import pandas as pd
# Read the execution details from the configuration file
f = open('/valohai/config/execution.json')
exec_details = json.load(f)
# Get the project name and the execution ID
project_name = exec_details['valohai.project-name'].split('/')[1]
exec_id = exec_details['valohai.execution-id']
# Use the project name and execution ID in the dataset version name
dataset_version_name = f"dataset://dataset-version-naming/{project_name}_{exec_id}"
Create the metadata
We will use the .metadata.json
sidecar file to create the dataset version and to update the dataset version alias called production
. If you have several dataset version aliases you want to update, you can provide them as a list. Remember that the alias latest
it updated automatically so you do now have to add that to the list!
metadata = {
"valohai.dataset-versions": [{
'uri': dataset_version_name,
'targeting_aliases': ['production']
}]
}
This is a very simple example. See the Introduction to datasets article to learn how to create new dataset based on existing ones.
.metadata.json
files also to attach tags and arbitrary metadata to output files.
Save the metadata with your files to create the dataset version
Finally, we'll save the output file and the .metadata.json sidecar file with it. If you have several files, you will need to have the respective sidecar files for each of them.
# Example DataFrame
d = {'Color': ["red", "green"], 'Number': [1, 2]}
df = pd.DataFrame(data=d)
# Save the file to /valohai/outputs/
file_path = valohai.outputs().path('data.csv')
df.to_csv(file_path, encoding='utf-8', index=False)
# Saving the .metadata.json sidecar file.
metadata_path = valohai.outputs().path('data.csv.metadata.json')
with open(metadata_path, 'w') as outfile:
json.dump(metadata, outfile)
Comments
0 comments
Please sign in to leave a comment.