Skip to content

reductoai/reducto-onprem-infra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reducto

Install Reducto on EKS using Terraform.

Reducto on-prem Architecture

Overview

The project creates Helm Release for Reducto on EKS in reducto namespace. And creates following required dependencies:

  1. RDS instance
  2. S3 bucket
  3. Keda (for autoscaling of Reducto workers in-cluster)
  4. Auto scaling of cluster nodes (Karpenter is configured, however you can use any cluster autoscaling tool)
  5. AWS Load balancer controller or Ingress Nginx (however you can use any ingress controller)

This project demonstrates fully working cluster that's needed to run Reducto. Cloudflare is not a requirement, however its used here to setup TLS along with cert-manager.

Upgrades

For upgrade instructions and release notes, see MIGRATION_GUIDE.md.

Terraform Documentation

Requirements

Name Version
terraform >= 1.2.0
aws 6.28.0
helm 3.1.1
kubectl 1.19.0
kubernetes 3.0.1
null 3.2.4
random 3.8.0

Providers

Name Version
aws 6.28.0
helm 3.1.1
kubectl 1.19.0
kubernetes 3.0.1
random 3.8.0

Modules

Name Source Version
ebs_csi_irsa_role terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts v6.4.0
eks terraform-aws-modules/eks/aws 21.15.1
karpenter terraform-aws-modules/eks/aws//modules/karpenter 21.12.0
load_balancer_controller_irsa_role terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts v6.4.0
rds terraform-aws-modules/rds/aws 7.1.0
rds_proxy terraform-aws-modules/rds-proxy/aws 4.2.1
rds_proxy_sg terraform-aws-modules/security-group/aws 5.2
rds_sg terraform-aws-modules/security-group/aws 5.2.0
vpc terraform-aws-modules/vpc/aws 6.6.0
vpc_cni_irsa_role terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts v6.4.0

Resources

Name Type
aws_db_subnet_group.default resource
aws_iam_role.rds_enhanced_monitoring resource
aws_iam_role.reducto resource
aws_iam_role_policy.reducto resource
aws_iam_role_policy_attachment.rds_enhanced_monitoring resource
aws_s3_bucket.reducto_storage resource
aws_s3_bucket_lifecycle_configuration.reducto_storage_lifecycle resource
aws_s3_bucket_public_access_block.reducto_storage_public_access_block resource
aws_secretsmanager_secret.superuser resource
aws_secretsmanager_secret_version.superuser resource
aws_security_group_rule.allow_all_cluster_and_nodes_traffic resource
aws_security_group_rule.allow_all_cluster_and_nodes_traffic_ingress resource
aws_security_group_rule.allow_all_intra_node_traffic resource
aws_security_group_rule.allow_eks_cluster_access_from_vpc resource
aws_security_group_rule.webhook_admission_inbound resource
aws_security_group_rule.webhook_admission_outbound resource
helm_release.aws_load_balancer_controller resource
helm_release.cert_manager resource
helm_release.datadog resource
helm_release.ingress_nginx resource
helm_release.karpenter resource
helm_release.karpenter-crd resource
helm_release.keda resource
helm_release.kube_prometheus_stack resource
helm_release.nvidia_device_plugin resource
helm_release.opentelemetry_collector resource
helm_release.prometheus_crds resource
helm_release.reducto resource
helm_release.telegraf resource
helm_release.vllm_stack resource
kubectl_manifest.cloudflare_api_secret resource
kubectl_manifest.cluster_issuer resource
kubectl_manifest.cluster_issuer_staging resource
kubectl_manifest.cluster_manifests resource
kubectl_manifest.datadog_secret resource
kubectl_manifest.karpenter_node_class resource
kubectl_manifest.karpenter_node_pool resource
kubectl_manifest.monitoring_ns resource
kubectl_manifest.otel_auth_secret resource
kubectl_manifest.otel_datadog_secret resource
kubectl_manifest.prometheus_rules resource
kubectl_manifest.telegraf resource
kubectl_manifest.telegraf_sm resource
kubernetes_secret_v1.hf_token resource
random_password.db_password resource
random_string.secret_suffix resource
aws_availability_zones.available data source
aws_eks_cluster_auth.eks data source
aws_iam_policy_document.rds_enhanced_monitoring data source
aws_iam_policy_document.reducto data source
kubectl_filename_list.cluster_manifests data source
kubectl_filename_list.prometheus_rules data source

Inputs

Name Description Type Default Required
cloudflare_api_token Cloudflare API token for Cert Manager to use DNS solver for issuing TLS certificates string n/a yes
cluster_endpoint_public_access Enable public access to the EKS cluster API endpoint bool true no
cluster_endpoint_public_access_cidrs List of CIDR blocks allowed to access the public EKS API endpoint list(string)
[
"0.0.0.0/0"
]
no
cluster_name Name of the EKS cluster and prefix for related resources string "reducto-ai" no
datadog_api_key Datadog API key string "" no
datadog_site Datadog site string "us3.datadoghq.com" no
db_deletion_protection Enable deletion protection for RDS database to prevent accidental deletion bool true no
db_instance_class Instance class for Reducto Postgres database string "db.t4g.medium" no
db_multi_az Enable Multi-AZ deployment for RDS database for high availability bool true no
db_username Postgres DB username string "reducto" no
enable_gpu_managed_node_group Whether to create the GPU managed node group (system_gpu) for GPU workloads bool false no
enable_nvidia_device_plugin Whether to install the NVIDIA device plugin for GPU support bool false no
enable_otel_collector Whether to deploy the OpenTelemetry Collector on the cluster bool false no
enable_reducto Whether to deploy the Reducto application via Helm bool true no
enable_vllm_stack Whether to deploy the vLLM stack on the cluster bool false no
helm_release_timeout Timeout in seconds for Helm release operations number 900 no
otel_auth_token Auth token used by the OpenTelemetry collector string "" no
otel_datadog_api_key Datadog API key used by the OpenTelemetry collector exporter string "admin" no
otel_host FQDN for exposing the OpenTelemetry Collector string "" no
private_subnets List of private subnets CIDRs list(string) [] no
public_subnets List of public subnets CIDRs list(string) [] no
reducto_helm_chart Path to Helm Chart on OCI registry string "oci://registry.reducto.ai/reducto-api/reducto" no
reducto_helm_chart_version Reducto Helm Chart version string "1.11.32" no
reducto_helm_repo_password Password for Helm Registry for Reducto Helm Chart string n/a yes
reducto_helm_repo_username Username for Helm Registry for Reducto Helm Chart string n/a yes
reducto_host Full host DNS for Reducto (Example: reducto.mydomain.com) string n/a yes
region AWS region where resources will be created string "us-east-1" no
slack_webhook_url Slack Webhook URL for Alertmanager string n/a yes
vllm_stack_hf_token Hugging Face API token used by the vLLM stack for model access string "" no
vpc_cidr CIDR block for the VPC string "10.125.0.0/16" no

Outputs

Name Description
cluster_certificate_authority_data Base64 encoded certificate data required to communicate with the cluster
cluster_endpoint Endpoint for EKS control plane
cluster_name Name of the EKS cluster
cluster_security_group_id Security group ID attached to the EKS cluster
configure_kubectl Command to configure kubectl for the EKS cluster
db_instance_endpoint Connection endpoint for the RDS instance
db_instance_name Name of the RDS database
db_proxy_arn ARN of the RDS Proxy
db_proxy_endpoint Connection endpoint for the RDS Proxy
oidc_provider_arn ARN of the OIDC Provider for EKS
private_subnets List of IDs of private subnets
public_subnets List of IDs of public subnets
reducto_host Hostname where Reducto is accessible
reducto_iam_role_arn ARN of the IAM role for Reducto service account
region AWS region where resources are deployed
s3_bucket_arn ARN of the S3 bucket for Reducto storage
s3_bucket_name Name of the S3 bucket for Reducto storage
vpc_id ID of the VPC

Helm Chart

To obtain or inspect Helm Chart and available configurations in values.yaml

# Login
helm registry login registry.reducto.ai \
    --username <your-username>  \
    --password <your-password>

# Get latest Helm Chart
helm pull oci://registry.reducto.ai/reducto-api/reducto

Security

All worklods are only created in private subnet, including NLB for ingress-nginx.

For bootstrapping of the cluster both public and private endpoints are enabled, public endpoint access can be restricted or removed after provisioning:

  1. Remove public endpoint cluster_endpoint_public_access = false.
  2. Restrict public endpoint cluster_endpoint_public_access_cidrs = [ vpc_cidr ]

Terraform State

To use a bucket for Terraform state, create a bucket and update backend.tf.

OR you can skip this to quickly run Terraform plan and apply with locally managed terraform.tfstate state file for testing purposes.

Configuration

Make sure variables.tf has configuration that you desire, like restricting EKS public endpoint, avoiding VPC CIDR collisions, or database instance type.

Create terraform.tfvars with following contents:

reducto_helm_repo_username = "todo"
reducto_helm_repo_password = "todo"
reducto_host = "reducto.example.com"
cloudflare_api_token = "token"

# For alerting
slack_webhook_url = "todo"

Provisioning

Apply Terraform

terraform init
terraform plan
terraform apply

Configure Cloudflare DNS

Cloudflare DNS is used to obtain TLS certificate from Letsencrypt via cert-manager using dns01 solver.

Check the private LB hostname created by cluster for Nginx Ingress Controller and use it to create CNAME DNS record on Cloudflare to point to value provided in reducto_host.

Access Reducto

Reducto will be accessible on ingress-nginx NLB via hostname configured in reducto_host

For checking Reducto service health without public endpoint: port forward your local 4567 to Reducto service:

kubectl port-forward service/reducto-reducto-http 4567:80 -n reducto

# Access Reducto
curl localhost:4567

New AWS account

For Karpenter to request spot instances, create the service-linked role:

aws iam create-service-linked-role --aws-service-name spot.amazonaws.com

Notes on Destroy

To terraform destroy, comment out the lifecycle block in reducto-bucket.tf and remove deletion protection from DB.

You can remove deletion protection by setting var.db_deletion_protection = false and terraform apply.

terraform destroy may not finish because VPC will contain resources created outside of Terraform managment:

  • NLB for nginx controller created by AWS load balancer controller
  • EKS Nodes from autoscaling by Karpenter
  • Bucket not empty

So along side terraform destroy you'll need to manually delete above resources from AWS console.

Notes on NLB for Nginx

To customize NLB configuration:

Monitoring

Reducto internal job queue length is a good indicator of overall worker health. And 5xx metric from Reducto ingress is a good indicator of API health.

PrometheusRule in manifests/prometheus/rules/01-reducto.yaml monitors internal queue length and 5xx metrics. When queue doesn't go down for a long duration OR API returns 5xx status for a long duration, alerts are sent to configured Slack channel.

About

Reducto installation on EKS along with required dependencies

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages