This guide will go through how to prepare your AWS account for Valohai self-hosted Installation, AWS resources that will be set up and the access control permissions that need to be configured.
Valohai Components
Valohai Roi - EC2 Instance
- Runs core Valohai applications like the web app, scaling services, API services, and deployment image builder.
- The end-users will access the web application running on this instance (port 8000)
PostgreSQL
- A relational database that contains user data and saves execution details such as which worker type was used, what commands were run, what Docker image was used, which inputs where used and what was the launch configuration.
ElastiCache (Redis)
- Stores information about the job queue and short-term execution logs so they can be shown on the web app and API in real-time. Each job is connected to a queue.
- The workers fetch a job from the Redis job queue based on their queue name (e.g. machines that belong to queue `t3.medium` will fetch only jobs that marked for that queue)
S3 bucket
- Valohai stores Git commit snapshots in S3 to maintain reproducibility. Worker instances download the user code archives from this storage.
- Real-time logs are moved to a persistent storage after the target execution finishes.
Valohai Workers - EC2 instances
- Workers, are different types of VM instances that are launched by Valohai for user initiated and scheduled machine learning executions.
- In Valohai you'll choose an `Environment` that defines what type of a worker will be used for the execution. Each environment has a queue name that defines which queue should it pick up jobs from. All VM instances belong to one queue.
- There should be one queue for each VM instance configuration type, e.g. a queue for all `t3.medium` machines, a queue for all `p2.2xlarge`, etc.
AWS Configuration
VPC
Valohai can be deployed either in your existing VPC, or in a new separate VPC.
Security Groups
Name |
Inbound |
Outbound |
---|---|---|
valohai-sg-workers |
Allow SSH connection for admins for debugging purposes |
Block outbound access if ML jobs are not allowed to access the public internet. |
valohai-sg-master |
Port: 22 Source: IP of admin who will do the installation.
Port: 80 Source: valohai-sg-loadblanacer |
|
valohai-sg-database |
Port: 5432 Source: valohai-sg-master |
|
valohai-sg-queue |
Port: 6379 Source: valohai-sg-master, valohai-sg-workers |
|
valohai-sg-loadbalancer |
Port: 443 Source: 0.0.0.0/0 |
All traffic |
IAM
ValohaiWorker - IAM Role
- default role for all created EC2 instances launched by Valohai for ML jobs.
- This is the minimum requirement. You could add additional permissions, for example Redshift data access: https://help.valohai.com/hc/en-us/articles/4421460555409-Access-AWS-Redshift-from-an-execution
- This way you don't have worry key rotation and creation, but instead user code can access the machine credentials from AWS metadata.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "1",
"Effect": "Allow",
"Action": "autoscaling:SetInstanceProtection",
"Resource": "*"
},
{
"Sid": "2",
"Effect": "Allow",
"Action": "ec2:DescribeInstances",
"Resource": "*"
}
]
}
ValohaiMaster - IAM User
- Used for creating and scaling of EC2 resources for ML jobs launched by users.
- This also has access to Valohai default S3 bucket.
- Can access secrets from AWS secrets manager that are tagged with Valohai.
{
"Version" : "2012-10-17",
"Statement" : [
{
"Sid" : "2",
"Effect" : "Allow",
"Action" : [
"ec2:DescribeInstances",
"ec2:DescribeVpcs",
"ec2:DescribeKeyPairs",
"ec2:DescribeImages",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeInstanceTypes",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeLaunchTemplateVersions",
"ec2:DescribeInstanceAttribute",
"ec2:CreateTags",
"ec2:DescribeInternetGateways",
"ec2:DescribeRouteTables",
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeScalingActivities"
],
"Resource" : "*"
},
{
"Sid" : "AllowUpdatingSpotLaunchTemplates",
"Effect" : "Allow",
"Action" : [
"ec2:CreateLaunchTemplate",
"ec2:CreateLaunchTemplateVersion",
"ec2:ModifyLaunchTemplate",
"ec2:RunInstances",
"ec2:TerminateInstances",
"ec2:RebootInstances",
"autoscaling:UpdateAutoScalingGroup",
"autoscaling:CreateOrUpdateTags",
"autoscaling:SetDesiredCapacity",
"autoscaling:CreateAutoScalingGroup"
],
"Resource" : "*",
"Condition" : {
"ForAllValues:StringEquals" : {
"aws:ResourceTag/valohai" : "1"
}
}
},
{
"Sid" : "ServiceLinkedRole",
"Effect" : "Allow",
"Action" : "iam:CreateServiceLinkedRole",
"Resource" : "arn:aws:iam::*:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"
},
{
"Sid" : "4",
"Effect" : "Allow",
"Action" : [
"iam:PassRole",
"iam:GetRole"
],
"Resource" : "arn:aws:iam::ACCOUNT-ID:role/ValohaiWorkerRole"
},
{
"Sid" : "0",
"Effect" : "Allow",
"Action" : [
"secretsmanager:GetResourcePolicy",
"secretsmanager:GetSecretValue",
"secretsmanager:DescribeSecret",
"secretsmanager:ListSecretVersionIds"
],
"Resource" : "*",
"Condition" : {
"StringEquals" : {
"secretsmanager:ResourceTag/valohai" : "1"
}
}
},
{
"Action" : "secretsmanager:GetRandomPassword",
"Resource" : "*",
"Effect" : "Allow",
"Sid" : "1"
},
{
"Effect" : "Allow",
"Action" : "s3:*",
"Resource" : [
"arn:aws:s3:::your S3 bucket name",
"arn:aws:s3:::your S3 bucket name/*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogStreams",
"logs:DescribeLogGroups"
],
"Resource": [
"arn:aws:logs:*:*:log-group:*",
"arn:aws:logs:*:*:log-group:*:log-stream:",
"arn:aws:logs:*:*:log-group:*:log-stream:*"
]
}
]
}
The last effect block is only required fror Cloudwatch. If you are not using Cloudwatch you can remove the last block completely.
ValohaiMultiPartUploadRole- IAM Role
Used to upload files over 5gb to S3 bucket.https://aws.amazon.com/about-aws/whats-new/2010/11/10/Amazon-S3-Introducing-Multipart-Upload/
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1503921756000",
"Effect": "Allow",
"Action": [
"s3:AbortMultipartUpload",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListBucketVersions",
"s3:ListMultipartUploadParts",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::your S3 bucket name",
"arn:aws:s3:::your S3 bucket name/*"
]
}
]
}
Optional setting up a Kubernetes cluster
Follow this guide only if you are planning to deploy models from Valohai to a Kubernetes cluster.
https://help.valohai.com/hc/en-us/articles/5260094506129-Using-an-existing-Kubernetes-cluster-as-a-deployment-target-for-real-time-inference
Core Valohai Resources
Your configuration will depend on your organizations requirements. The below list describes the minimum configuration needed for Valohai.
Name |
Tags |
---|---|
EC2 |
Name: valohai-roi Security group: valohai-sg-master OS: Ubuntu 22.04 LTS Instance: m5a.xlarge Storage: 32gb |
S3 |
Name: yourbucketname-valohai
Block all public access. |
RDS |
Name: valohai-psql Class: db.t2.large Security Group: valohai-sg-database Port: 5432 Public Accessibility: No Engine Version: 14.2 |
ElastiCache (Redis) |
Name: valohai-queue Node type: cache.m3.xlarge Number of nodes: 1 Engine version: 6.2 |
EC2 Load Balancer |
The Valohai web application is served at port 8000 on the EC2 instance. HTTP/2 Enabled. |
DNS Name |
Provide DNS name to point at the load balancer (used for the web application e.g. valohai.yourdomain.net) |
EKS Cluster |
Valohai can either deploy to an existing EKS cluster - or you can provision a new deployment cluster for Valohai. Details at: https://help.valohai.com/hc/en-us/articles/5260094506129-Using-an-existing-Kubernetes-cluster-as-a-deployment-target-for-real-time-inference |
Comments
0 comments
Please sign in to leave a comment.