You can access Snowflake from your Valohai executions to run queries.
Store your Snowflake credentials in Valohai
You can authenticate to your Snowflake database by using a username and password combination. In addition you will also need your Snowflake account identifier.
Start by setting up the connection details as environment variables:
-
Open your project at app.valohai.com
-
Go to project Settings and open the Env Variables tab
-
Create the following environment variables
Name |
Value |
Secret |
---|---|---|
USERNAME |
Your Snowflake username |
No |
PASSWORD |
You Snowflake password |
Yes |
ACCOUNT_IDENTIFIER |
Your Snowflake account identifier |
No |
Make sure to make the password secret by checking the checkbox!
These environment variables will be now available for all executions that are run inside this project. This means that anyone having access to the project can run queries by using the credentials inside their executions.
Install the Snowflake connector in your container
To run queries on your Snowflake database, you will need to have the connector installed. You can do that via pip install
in the command
section of your valohai.yaml
or alternatively you can add it to your custom Docker image.
The Snowflake Python connector requires Python version to be 3.6, 3.7, 3.8 or 3.9.
In addition to the connector, you should also install the corresponding dependencies, as recommended by Snowflake. These depend on the Python version you're using.
The example below will install the dependencies recommended for Python 3.9 and then install the connector version 2.8.1 in the container.
- step:
name: snowflake-connect
image: python:3.9
command:
- pip install -r https://raw.githubusercontent.com/snowflakedb/snowflake-connector-python/v2.8.1/tested_requirements/requirements_39.reqs
- pip install snowflake-connector-python==2.8.1
- python validate.py
To learn more about installing the connector and the dependencies, please refer to Snowflake's documentation.
Connect to Snowflake and run queries in your executions
In order to connect to your Snowflake database you'll need to import the snowflake.connector
. You will also need to import the os
library for handling the environment variables.
The example below creates an example database in an existing warehouse, creates an example table with data in it and finally reads the contents of the table. If you have an existing database, feel free to run queries on it instead.
import snowflake.connector
import os
# Create the connector by using the credentials stored in the envvars
conn = snowflake.connector.connect(
user=os.environ['USERNAME'],
password=os.environ['PASSWORD'],
account=os.environ['ACCOUNT_IDENTIFIER']
)
cs = conn.cursor()
try:
# Use existing warehouse
cs.execute("USE WAREHOUSE tiny_warehouse_mg")
# Create a new database in the warehouse
cs.execute("CREATE DATABASE IF NOT EXISTS mydatabase")
# Use the database (here "mydatabase")
cs.execute("USE DATABASE mydatabase")
# Create example_table and insert data
cs.execute(
"CREATE OR REPLACE TABLE "
"example_table(col1 integer, col2 string)")
cs.execute(
"INSERT INTO example_table(col1, col2) VALUES " +
" (123, 'test string1'), " +
" (456, 'test string2')")
# Read the table
cs.execute(
"SELECT * FROM example_table"
)
result = cs.fetchall()
print("Contents of example_table:")
for x in result:
print(x)
finally:
cs.close()
conn.close()
Maintaining reproducibility
As your data lives and changes in Snowflake so will your query results. Running a query today will return a certain query result but running the same query next week might return a different result.
It’s strongly recommended that you save the query result, or a preprocessed dataset, to /valohai/outputs/
to keep a snapshot of the exact query result before you run any training with that data.
For example, your pipeline could look like the one below:
-
Fetch data from Snoflake and preprocess the data. Save the preprocessed data to
/valohai/outputs
so it gets saved to your data store. -
Use the preprocessed data from the data store in any further training jobs.
If you then in the future need to reproduce that training or inspect what’s the actual data that your model was trained on, you can easily rerun it on Valohai or download the dataset instead of relying on the query result of that day.
Comments
0 comments
Please sign in to leave a comment.