Install Reducto on EKS using Terraform.
The project creates Helm Release for Reducto on EKS in reducto namespace. And creates following required dependencies:
- RDS instance
- S3 bucket
- Keda (for autoscaling of Reducto workers in-cluster)
- Auto scaling of cluster nodes (Karpenter is configured, however you can use any cluster autoscaling tool)
- AWS Load balancer controller or Ingress Nginx (however you can use any ingress controller)
This project demonstrates fully working cluster that's needed to run Reducto. Cloudflare is not a requirement, however its used here to setup TLS along with cert-manager.
For upgrade instructions and release notes, see MIGRATION_GUIDE.md.
| Name | Version |
|---|---|
| terraform | >= 1.2.0 |
| aws | 6.28.0 |
| helm | 3.1.1 |
| kubectl | 1.19.0 |
| kubernetes | 3.0.1 |
| null | 3.2.4 |
| random | 3.8.0 |
| Name | Version |
|---|---|
| aws | 6.28.0 |
| helm | 3.1.1 |
| kubectl | 1.19.0 |
| kubernetes | 3.0.1 |
| random | 3.8.0 |
| Name | Source | Version |
|---|---|---|
| ebs_csi_irsa_role | terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts | v6.4.0 |
| eks | terraform-aws-modules/eks/aws | 21.15.1 |
| karpenter | terraform-aws-modules/eks/aws//modules/karpenter | 21.12.0 |
| load_balancer_controller_irsa_role | terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts | v6.4.0 |
| rds | terraform-aws-modules/rds/aws | 7.1.0 |
| rds_proxy | terraform-aws-modules/rds-proxy/aws | 4.2.1 |
| rds_proxy_sg | terraform-aws-modules/security-group/aws | 5.2 |
| rds_sg | terraform-aws-modules/security-group/aws | 5.2.0 |
| vpc | terraform-aws-modules/vpc/aws | 6.6.0 |
| vpc_cni_irsa_role | terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts | v6.4.0 |
| Name | Description | Type | Default | Required |
|---|---|---|---|---|
| cloudflare_api_token | Cloudflare API token for Cert Manager to use DNS solver for issuing TLS certificates | string |
n/a | yes |
| cluster_endpoint_public_access | Enable public access to the EKS cluster API endpoint | bool |
true |
no |
| cluster_endpoint_public_access_cidrs | List of CIDR blocks allowed to access the public EKS API endpoint | list(string) |
[ |
no |
| cluster_name | Name of the EKS cluster and prefix for related resources | string |
"reducto-ai" |
no |
| datadog_api_key | Datadog API key | string |
"" |
no |
| datadog_site | Datadog site | string |
"us3.datadoghq.com" |
no |
| db_deletion_protection | Enable deletion protection for RDS database to prevent accidental deletion | bool |
true |
no |
| db_instance_class | Instance class for Reducto Postgres database | string |
"db.t4g.medium" |
no |
| db_multi_az | Enable Multi-AZ deployment for RDS database for high availability | bool |
true |
no |
| db_username | Postgres DB username | string |
"reducto" |
no |
| enable_gpu_managed_node_group | Whether to create the GPU managed node group (system_gpu) for GPU workloads | bool |
false |
no |
| enable_nvidia_device_plugin | Whether to install the NVIDIA device plugin for GPU support | bool |
false |
no |
| enable_otel_collector | Whether to deploy the OpenTelemetry Collector on the cluster | bool |
false |
no |
| enable_reducto | Whether to deploy the Reducto application via Helm | bool |
true |
no |
| enable_vllm_stack | Whether to deploy the vLLM stack on the cluster | bool |
false |
no |
| helm_release_timeout | Timeout in seconds for Helm release operations | number |
900 |
no |
| otel_auth_token | Auth token used by the OpenTelemetry collector | string |
"" |
no |
| otel_datadog_api_key | Datadog API key used by the OpenTelemetry collector exporter | string |
"admin" |
no |
| otel_host | FQDN for exposing the OpenTelemetry Collector | string |
"" |
no |
| private_subnets | List of private subnets CIDRs | list(string) |
[] |
no |
| public_subnets | List of public subnets CIDRs | list(string) |
[] |
no |
| reducto_helm_chart | Path to Helm Chart on OCI registry | string |
"oci://registry.reducto.ai/reducto-api/reducto" |
no |
| reducto_helm_chart_version | Reducto Helm Chart version | string |
"1.11.32" |
no |
| reducto_helm_repo_password | Password for Helm Registry for Reducto Helm Chart | string |
n/a | yes |
| reducto_helm_repo_username | Username for Helm Registry for Reducto Helm Chart | string |
n/a | yes |
| reducto_host | Full host DNS for Reducto (Example: reducto.mydomain.com) | string |
n/a | yes |
| region | AWS region where resources will be created | string |
"us-east-1" |
no |
| slack_webhook_url | Slack Webhook URL for Alertmanager | string |
n/a | yes |
| vllm_stack_hf_token | Hugging Face API token used by the vLLM stack for model access | string |
"" |
no |
| vpc_cidr | CIDR block for the VPC | string |
"10.125.0.0/16" |
no |
| Name | Description |
|---|---|
| cluster_certificate_authority_data | Base64 encoded certificate data required to communicate with the cluster |
| cluster_endpoint | Endpoint for EKS control plane |
| cluster_name | Name of the EKS cluster |
| cluster_security_group_id | Security group ID attached to the EKS cluster |
| configure_kubectl | Command to configure kubectl for the EKS cluster |
| db_instance_endpoint | Connection endpoint for the RDS instance |
| db_instance_name | Name of the RDS database |
| db_proxy_arn | ARN of the RDS Proxy |
| db_proxy_endpoint | Connection endpoint for the RDS Proxy |
| oidc_provider_arn | ARN of the OIDC Provider for EKS |
| private_subnets | List of IDs of private subnets |
| public_subnets | List of IDs of public subnets |
| reducto_host | Hostname where Reducto is accessible |
| reducto_iam_role_arn | ARN of the IAM role for Reducto service account |
| region | AWS region where resources are deployed |
| s3_bucket_arn | ARN of the S3 bucket for Reducto storage |
| s3_bucket_name | Name of the S3 bucket for Reducto storage |
| vpc_id | ID of the VPC |
To obtain or inspect Helm Chart and available configurations in values.yaml
# Login
helm registry login registry.reducto.ai \
--username <your-username> \
--password <your-password>
# Get latest Helm Chart
helm pull oci://registry.reducto.ai/reducto-api/reducto
All worklods are only created in private subnet, including NLB for ingress-nginx.
For bootstrapping of the cluster both public and private endpoints are enabled, public endpoint access can be restricted or removed after provisioning:
- Remove public endpoint
cluster_endpoint_public_access = false. - Restrict public endpoint
cluster_endpoint_public_access_cidrs = [ vpc_cidr ]
To use a bucket for Terraform state, create a bucket and update backend.tf.
OR you can skip this to quickly run Terraform plan and apply with locally managed terraform.tfstate state file for testing purposes.
Make sure variables.tf has configuration that you desire, like restricting EKS public endpoint, avoiding VPC CIDR collisions, or database instance type.
Create terraform.tfvars with following contents:
reducto_helm_repo_username = "todo"
reducto_helm_repo_password = "todo"
reducto_host = "reducto.example.com"
cloudflare_api_token = "token"
# For alerting
slack_webhook_url = "todo"
Apply Terraform
terraform init
terraform plan
terraform apply
Cloudflare DNS is used to obtain TLS certificate from Letsencrypt via cert-manager using dns01 solver.
Check the private LB hostname created by cluster for Nginx Ingress Controller and use it to create CNAME DNS record on Cloudflare to point to value provided in reducto_host.
Reducto will be accessible on ingress-nginx NLB via hostname configured in reducto_host
For checking Reducto service health without public endpoint: port forward your local 4567 to Reducto service:
kubectl port-forward service/reducto-reducto-http 4567:80 -n reducto
# Access Reducto
curl localhost:4567
For Karpenter to request spot instances, create the service-linked role:
aws iam create-service-linked-role --aws-service-name spot.amazonaws.comTo terraform destroy, comment out the lifecycle block in reducto-bucket.tf and remove deletion protection from DB.
You can remove deletion protection by setting var.db_deletion_protection = false and terraform apply.
terraform destroy may not finish because VPC will contain resources created outside of Terraform managment:
- NLB for nginx controller created by AWS load balancer controller
- EKS Nodes from autoscaling by Karpenter
- Bucket not empty
So along side terraform destroy you'll need to manually delete above resources from AWS console.
To customize NLB configuration:
- See AWS Load Balancer controller annotations for Service, and Ingress Nginx Helm Chart configuration.
- For NLB TLS Termination with ACM ssl cert (without cert-manager), configure target port in
values/ingress-nginx-controller.yaml.service: targetPorts: https: http
Reducto internal job queue length is a good indicator of overall worker health. And 5xx metric from Reducto ingress is a good indicator of API health.
PrometheusRule in manifests/prometheus/rules/01-reducto.yaml monitors internal queue length and 5xx metrics. When queue doesn't go down for a long duration OR API returns 5xx status for a long duration, alerts are sent to configured Slack channel.
