The Compute and Data Layer of Valohai can be deployed to your on-premise environment. This enables you to:
- Use your own on-premises machines to run machine learning jobs.
- Use your own cloud storage for storing training artifacts, like trained models, preprocessed datasets, visualizations, etc.
- Mount local data to your on-premises workers.
- Access databases and date warehouses directly from the workers, which are inside your network.
Valohai doesn’t have direct access to the on-premises machine that executes the machine learning jobs. Instead, it communicates with a separate static virtual machine in your on-premise environment that’s responsible for storing the job queue, job states, and short-term logs.
Installing the Valohai worker manually
The Valohai agent (Peon) is responsible for fetching new jobs, writing logs, and updating the job states for Valohai. If your server is running on Ubuntu, you can simply use the
bup to install all the required dependencies. With other Linux distributions, you'll need to do that manually.
bupwill install, you can get that from your Valohai contact. Moreover, if you already have some of them installed, you can use the
--onlyto only install the missing ones. For example,
--only=*peon*will install only the Valohai agent and no other dependecies.
queue-name, the name that this on-premises machine will use.
queue-address, DNS name assigned to the queue in your subscription.
redis-passwordthat your queue uses. This is usually stored in your cloud provider's Secret Manager.
url, download URL for the Valohai worker agent.
The queue name is a name that you define to add that instance to a queue group. For example:
Each machine can have its own queue but we recommended using the same queue name on all machines that have the same configuration and are used for the same purpose.
- Python 3.8+
- Nvidia drivers
- Remember to choose the correct distribution: https://docs.docker.com/engine/install/
Note that the Nvidia drivers and nvidia-docker are only needed if you plan to use the GPU on the machine. Verify that they work by launching a GPU enabled container locally and run
nvidia-smi in the container.
Peon (the Valohai agent) expects to call either
nvidia-docker, both without arguments. It doesn't support docker
--runtime=nvidia natively yet. To fix that you should install a script for
cd /usr/local/bin curl -fsSL https://raw.githubusercontent.com/NVIDIA/nvidia-docker/master/nvidia-docker > nvidia-docker chmod u+x nvidia-docker
Download and install Peon manually
Make sure you have for both
tar installed on the machine. You can get the
<URL> from your Valohai contact.
wget <URL> mkdir peon tar -C peon/ -xvf peon.tar pip install peon/*.whl
Next, create a peon configuration in
/etc/peon.config. Make sure you replace fields in the
QUEUES with the
queue-name and in the
REDIS_URL with your
redis-password and the
queue-address. The password should be stored in the Secret Manager/Key Vault in your cloud account.
Note that the
DOCKER_COMMAND is either
nvidia-docker depending on your installation.
CLOUD=none DOCKER_COMMAND=nvidia-docker INSTALLATION_TYPE=private-worker QUEUES=<queue-name> REDIS_URL=https://:<redis-password>@queue-address>:63790 ALLOW_MOUNTS=true
You will also need to create the service file
etc/systemd/system/peon.service for the Valohai agent. The
ExecStart should point to the local installation, such as
In addition, the
Group should be replaced by those that are relevant in your case.
[Unit] Description=Valohai Peon Service After=network.target [Service] Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8 EnvironmentFile=/etc/peon.config ExecStart=/home/valohai/.local/bin/valohai-peon User=valohai Group=valohai Restart=on-failure [Install] WantedBy=multi-user.target
Lastly, you should also create the Peon clean up service and a timer for it. This service will take care of cleaning the cached inputs and Docker images to avoid running out of disk space on the machine.
Create the file
/etc/systemd/system/peon-clean.service. Remember that ExecStart should point to the local installation.
Group should be replaced by the relevant ones here as well.
[Unit] Description=Valohai Peon Cleanup After=network.target [Service] Type=oneshot Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8 EnvironmentFile=/etc/peon.config ExecStart=/home/valohai/.local/bin/valohai-peon clean User=valohai Group=valohai [Install] WantedBy=multi-user.target
The cleaning service will also need a timer. Copy-paste the following into
[Unit] Description=Valohai Peon Cleanup Timer Requires=peon-clean.service [Timer] # Every 10 minutes. OnCalendar=*:0/10 Persistent=true [Install] WantedBy=timers.target
Make sure that the User defined in the files (here
valohai) has Docker control rights. You can add them by running the command:
sudo usermod -aG docker <User>
Now you can reload the unit files and start the service.
systemctl daemon-reload systemctl start peon
systemctl start peon-clean
systemctl start peon-clean.timer
If the services are failing to start, try using
/usr/bin/env python3 -m peon.cli in the
ExecStart field in both the
peon.service and the
Once everything works as expected, add the services to start automatically with boot.
systemctl enable peon systemctl enable peon-clean