The Compute and Data Layer of Valohai can be deployed to your on-premise environment. This enables you to:
- Use your own on-premises machines to run machine learning jobs.
- Use your own cloud storage for storing training artifacts, like trained models, preprocessed datasets, visualizations, etc.
- Mount local data to your on-premises workers.
- Access databases and date warehouses directly from the workers, which are inside your network.
Valohai doesn’t have direct access to the on-premises machine that executes the machine learning jobs. Instead, it communicates with a separate static virtual machine in your on-premise environment that’s responsible for storing the job queue, job states, and short-term logs.
Installing the Valohai worker manually
The Valohai agent (Peon) is responsible for fetching new jobs, writing logs, and updating the job states for Valohai. If your server is running on Ubuntu, you can simply use the peon-bringup
aka bup
to install all the required dependencies. With other Linux distributions, you'll need to do that manually.
bup
will install, you can get that from your Valohai contact. Moreover, if you already have some of them installed, you can use the --only
to only install the missing ones. For example, --only=*peon*
will install only the Valohai agent and no other dependecies.
queue-name
, the name that this on-premises machine will use.queue-address
, DNS name assigned to the queue in your subscription.redis-password
that your queue uses. This is usually stored in your cloud provider's Secret Manager.url
, download URL for the Valohai worker agent.
The queue name is a name that you define to add that instance to a queue group. For example:
- myorg-onprem-1
- myorg-onprem-machine-name
- myorg-onprem-gpus
- myorg-onprem-gpus-prod
Each machine can have its own queue but we recommended using the same queue name on all machines that have the same configuration and are used for the same purpose.
Requirements
- Python 3.8+
- Nvidia drivers
- Docker
- Remember to choose the correct distribution: https://docs.docker.com/engine/install/
- nvidia-docker
Note that the Nvidia drivers and nvidia-docker are only needed if you plan to use the GPU on the machine. Verify that they work by launching a GPU enabled container locally and run nvidia-smi
in the container.
nvidia-docker
Peon (the Valohai agent) expects to call either docker
or nvidia-docker
, both without arguments. It doesn't support docker --runtime=nvidia
natively yet. To fix that you should install a script for nvidia-docker
:
cd /usr/local/bin
curl -fsSL https://raw.githubusercontent.com/NVIDIA/nvidia-docker/master/nvidia-docker > nvidia-docker
chmod u+x nvidia-docker
Download and install Peon manually
Make sure you have for both wget
and tar
installed on the machine. You can get the <URL>
from your Valohai contact.
wget <URL>
mkdir peon
tar -C peon/ -xvf peon.tar
pip install peon/*.whl
Next, create a peon configuration in /etc/peon.config
. Make sure you replace fields in the QUEUES
with the queue-name
and in the REDIS_URL
with your redis-password
and the queue-address
. The password should be stored in the Secret Manager/Key Vault in your cloud account.
Note that the DOCKER_COMMAND
is either docker
or nvidia-docker
depending on your installation.
CLOUD=none
DOCKER_COMMAND=nvidia-docker
INSTALLATION_TYPE=private-worker
QUEUES=<queue-name>
REDIS_URL=https://:<redis-password>@queue-address>:63790
ALLOW_MOUNTS=true
You will also need to create the service file etc/systemd/system/peon.service
for the Valohai agent. The ExecStart
should point to the local installation, such as /home/valohai/.local/bin/valohai-peon
or usr/local/bin/valohai-peon
.
In addition, the User
and Group
should be replaced by those that are relevant in your case.
[Unit]
Description=Valohai Peon Service
After=network.target
[Service]
Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8
EnvironmentFile=/etc/peon.config
ExecStart=/home/valohai/.local/bin/valohai-peon
User=valohai
Group=valohai
Restart=on-failure
[Install]
WantedBy=multi-user.target
Lastly, you should also create the Peon clean up service and a timer for it. This service will take care of cleaning the cached inputs and Docker images to avoid running out of disk space on the machine.
Create the file /etc/systemd/system/peon-clean.service
. Remember that ExecStart should point to the local installation. User
and Group
should be replaced by the relevant ones here as well.
[Unit]
Description=Valohai Peon Cleanup
After=network.target
[Service]
Type=oneshot
Environment=LC_ALL=C.UTF-8 LANG=C.UTF-8
EnvironmentFile=/etc/peon.config
ExecStart=/home/valohai/.local/bin/valohai-peon clean
User=valohai
Group=valohai
[Install]
WantedBy=multi-user.target
The cleaning service will also need a timer. Copy-paste the following into /etc/systemd/system/peon-clean.timer
.
[Unit]
Description=Valohai Peon Cleanup Timer
Requires=peon-clean.service
[Timer]
# Every 10 minutes.
OnCalendar=*:0/10
Persistent=true
[Install]
WantedBy=timers.target
Make sure that the User defined in the files (here valohai
) has Docker control rights. You can add them by running the command:
sudo usermod -aG docker <User>
Now you can reload the unit files and start the service.
systemctl daemon-reload
systemctl start peon
systemctl start peon-clean
systemctl start peon-clean.timer
If the services are failing to start, try using /usr/bin/env python3 -m peon.cli
in the ExecStart
field in both the peon.service
and the peon-clean.service
files.
Once everything works as expected, add the services to start automatically with boot.
systemctl enable peon
systemctl enable peon-clean
Comments
0 comments
Please sign in to leave a comment.