From e90e3f97924105e42f8f269c4d4d7046e765d064 Mon Sep 17 00:00:00 2001 From: Drew Minnear Date: Wed, 8 Apr 2026 15:35:24 -0400 Subject: [PATCH] add trilio cr pattern to sandbox --- .../trilio-continuous-recovery/_index.md | 112 +++++++++++ .../cr-operations.md | 103 +++++++++++ .../getting-started.md | 174 ++++++++++++++++++ .../prerequisites.md | 44 +++++ .../troubleshooting.md | 134 ++++++++++++++ .../_markup/render-codeblock-mermaid.html | 4 + layouts/_default/baseof.html | 6 + static/css/custom.css | 29 +++ 8 files changed, 606 insertions(+) create mode 100644 content/patterns/trilio-continuous-recovery/_index.md create mode 100644 content/patterns/trilio-continuous-recovery/cr-operations.md create mode 100644 content/patterns/trilio-continuous-recovery/getting-started.md create mode 100644 content/patterns/trilio-continuous-recovery/prerequisites.md create mode 100644 content/patterns/trilio-continuous-recovery/troubleshooting.md create mode 100644 layouts/_default/_markup/render-codeblock-mermaid.html diff --git a/content/patterns/trilio-continuous-recovery/_index.md b/content/patterns/trilio-continuous-recovery/_index.md new file mode 100644 index 000000000..d2b335608 --- /dev/null +++ b/content/patterns/trilio-continuous-recovery/_index.md @@ -0,0 +1,112 @@ +--- +title: Trilio Continuous Restore +date: 2026-04-08 +tier: sandbox +summary: A demonstration of Trilio Continuous Restore for stateful applications +rh_products: + - Red{nbsp}Hat OpenShift Container Platform + - Red{nbsp}Hat OpenShift GitOps + - Red{nbsp}Hat Advanced Cluster Management +partners: + - Trilio +industries: + - General +aliases: /trilio-cr/ +links: + github: https://github.com/trilio-demo/trilio-continuous-restore + install: getting-started + bugs: https://github.com/trilio-demo/trilio-continuous-restore/issues + feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform +--- + +# Trilio Continuous Restore — Red{nbsp}Hat Validated Pattern + +## Overview + +This Validated Pattern delivers an automated, GitOps-driven Disaster Recovery (DR) solution for stateful applications running on Red{nbsp}Hat OpenShift. By integrating [Trilio for Kubernetes](https://trilio.io) with the [Red{nbsp}Hat Validated Patterns framework](https://validatedpatterns.io), the pattern delivers: + +- **Automated backup** of stateful workloads on the primary (hub) cluster +- **Continuous Restore** — Trilio's accelerated Recovery Time Objective (RTO) DR path that continuously pre-stages backup data on the DR cluster so that recovery requires only metadata retrieval, not a full data transfer +- **Automated DR testing** — the full backup-to-restore lifecycle runs as a scheduled, self-healing GitOps workflow with no human intervention after initial setup +- **Multi-cluster lifecycle management** through Red{nbsp}Hat Advanced Cluster Management (ACM) + +### Use case + +The pattern targets organizations that need a documented, repeatable DR posture for Kubernetes-native workloads — particularly those that must demonstrate RTO/Recovery Point Objective (RPO) targets through regular, automated DR tests rather than annual manual exercises. + +A WordPress + MySQL deployment is included as a representative stateful application. It serves as the reference workload for the full backup, restore, and URL-rewrite lifecycle. + +--- + +## Architecture + +```mermaid +graph TD + subgraph Git["Git (Source of Truth)"] + values["values-hub.yaml\nvalues-secondary.yaml\ncharts/"] + end + + subgraph Hub["Hub Cluster (primary)"] + ACM["ACM"] + ArgoCD["ArgoCD"] + Vault["HashiCorp Vault + ESO"] + Trilio_Hub["Trilio Operator + TVM"] + CronJob["Imperative CronJob\n(DR lifecycle automation)"] + end + + subgraph Spoke["DR Cluster (secondary)"] + Trilio_Spoke["Trilio Operator + TVM"] + EventTarget["EventTarget pod\n(pre-stages PVCs)"] + ConsistentSet["ConsistentSet\n(restore point)"] + end + + S3["Shared S3 Bucket"] + + Git -->|GitOps sync| ArgoCD + ArgoCD --> Trilio_Hub + Vault -->|S3 creds + license| Trilio_Hub + Trilio_Hub -->|backups| S3 + ACM -->|provisions| Spoke + S3 -->|EventTarget polls| EventTarget + EventTarget --> ConsistentSet + CronJob -->|restore from ConsistentSet| ConsistentSet +``` + +### Component roles + +| Component | Where | Role | +|-----------|-------|------| +| Trilio Operator | Hub + Spoke | Installed through Operator Lifecycle Manager (OLM) from the `certified-operators` catalog, channel `5.3.x` | +| TrilioVaultManager | Hub + Spoke | Trilio operand Custom Resource (CR); manages the Trilio data plane | +| Red{nbsp}Hat OpenShift | Hub + Spoke | Container orchestration platform; provides OLM, storage, networking, and the GitOps operator substrate | +| Red{nbsp}Hat OpenShift GitOps (ArgoCD) | Hub + Spoke | GitOps sync engine; all configuration is driven from Git | +| Red{nbsp}Hat Advanced Cluster Management (ACM) | Hub | Cluster lifecycle, policy enforcement, and spoke provisioning | +| Validated Patterns Imperative CronJob | Hub + Spoke | Runs the automated DR lifecycle on a 10-minute schedule | +| BackupTarget | Hub + Spoke | Points to the shared S3 bucket; the spoke BackupTarget has the EventTarget flag set | +| BackupPlan | Hub | Defines backup scope (wordpress namespace), quiesce/unquiesce hooks, and retention | +| CR BackupPlan | Hub | Continuous Restore variant of BackupPlan; drives pre-staging on the spoke | +| EventTarget pod | Spoke | Watches the shared S3 bucket for new backups; pre-stages Persistent Volume Claims (PVCs) locally | +| ConsistentSet | Spoke | Cluster-scoped CR representing a fully pre-staged restore point | +| HashiCorp Vault and External Secrets Operator (ESO) | Hub | Secret management; S3 credentials and Trilio license are never stored in Git | + +### How Continuous Restore works + +1. The hub creates a backup using the CR BackupPlan and writes it to the shared S3 storage. +2. The EventTarget pod on the spoke detects the new backup and begins copying volume data locally — ahead of any DR event. +3. When the spoke's imperative job detects an Available ConsistentSet, it submits a Restore CR. Because the data is already local, only backup metadata is fetched — resulting in significantly lower RTO than a standard on-demand restore. +4. The post-restore Hook CR rewrites WordPress database URLs to the DR cluster's ingress domain. + +## Links + +- [Trilio for Kubernetes documentation](https://docs.trilio.io/kubernetes) +- [Red{nbsp}Hat Validated Patterns](https://validatedpatterns.io) +- [Validated Patterns imperative framework](https://validatedpatterns.io/learn/imperative-actions/) +- [Red{nbsp}Hat Advanced Cluster Management (ACM)](https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes) +- [External Secrets Operator](https://external-secrets.io) + +## Next steps + +- [Prerequisites](prerequisites) +- [Getting started](getting-started) +- [CR operations](cr-operations) +- [Troubleshooting](troubleshooting) diff --git a/content/patterns/trilio-continuous-recovery/cr-operations.md b/content/patterns/trilio-continuous-recovery/cr-operations.md new file mode 100644 index 000000000..b3fa6a70b --- /dev/null +++ b/content/patterns/trilio-continuous-recovery/cr-operations.md @@ -0,0 +1,103 @@ +--- +title: CR operations +weight: 30 +aliases: /trilio-cr/cr-operations/ +--- + +## Operations + +### Monitoring DR status + +```bash +# Hub — all phases +make dr-status + +# Spoke — ConsistentSet and restore status (run on spoke context) +oc get configmap trilio-cr-status -n imperative -o yaml +``` + +### Automated DR lifecycle + +The imperative framework runs continuously on a 10-minute schedule with no manual intervention required. The full lifecycle from a standing start (hub up, spoke just joined) to a completed Continuous Restore typically completes within 30–45 minutes. + +**Hub job sequence:** + +| Job | What it does | Skips when | +|-----|-------------|------------| +| `trilio-enable-cr` | Creates CR BackupPlan + ContinuousRestore Policy | CR BackupPlan already Available | +| `trilio-cr-backup` | Creates a backup against the CR BackupPlan | Available CR backup exists | +| `trilio-backup` | Creates a standard backup | Available backup exists | +| `trilio-restore-standard` | Restores to `wordpress-restore` on hub | Completed restore exists | +| `trilio-e2e-status` | Writes status ConfigMap; fails until all phases pass | — (always runs) | + +**Spoke job sequence (per DR cluster):** + +| Job | What it does | Skips when | +|-----|-------------|------------| +| `trilio-cr-status` | Validates ConsistentSet available; writes status ConfigMap | — (always runs; fails until Available) | +| `trilio-cr-restore` | Restores from latest ConsistentSet to `wordpress-restore` | Completed restore exists | + +### Manual backup + +To trigger a backup outside the automated schedule: + +```bash +ansible-navigator run ansible/playbooks/dr-backup.yaml +``` + +### Manual DR restore + +**Standard restore** (from a named backup): + +```bash +ansible-navigator run ansible/playbooks/dr-restore.yaml \ + -e restore_method=backup \ + -e restore_namespace= +``` + +**Continuous Restore** (from a pre-staged ConsistentSet on the DR cluster — accelerated RTO): + +```bash +ansible-navigator run ansible/playbooks/dr-restore.yaml \ + -e restore_method=consistentset \ + -e restore_namespace= +``` + +Both commands discover the cluster ingress domain automatically and apply the Route hostname transform. + +### Offboarding a spoke + +```bash +# Step 1 — on the hub context +make unlabel-spoke CLUSTER= + +# Step 2 — on the spoke context +make offboard-spoke CLUSTER= +``` + +### Uninstalling the pattern + +```bash +# On the hub context +make offboard-hub +``` + +> Save your HashiCorp Vault root token and unseal keys before running `offboard-hub`. They are stored in the `imperative` namespace which is removed during offboard. + +--- + +## Ansible playbook reference + +| Playbook | When to use | Key inputs | +|----------|-------------|------------| +| `dr-backup.yaml` | Trigger a manual backup on the hub | — | +| `dr-restore.yaml` | Manual restore (backup or ConsistentSet method) | `restore_method`, `restore_namespace`, `source_backup` (optional) | +| `validate-trilio.yaml` | Pre/post-change Trilio health validation | — | +| `offboard-spoke.yaml` | Remove spoke-side Trilio resources | `cluster_name` | +| `offboard-hub.yaml` | Full hub pattern teardown | — | + +Playbooks are run by using `ansible-navigator`: + +```bash +ansible-navigator run ansible/playbooks/.yaml [-e key=value ...] +``` diff --git a/content/patterns/trilio-continuous-recovery/getting-started.md b/content/patterns/trilio-continuous-recovery/getting-started.md new file mode 100644 index 000000000..ad6cd01ca --- /dev/null +++ b/content/patterns/trilio-continuous-recovery/getting-started.md @@ -0,0 +1,174 @@ +--- +title: Getting started +weight: 20 +aliases: /trilio-cr/getting-started/ +--- + +# Deploying the pattern + +## Deployment + +### 1. Clone the repository + +```bash +git clone https://github.com/trilio-demo/trilio-continuous-restore +cd trilio-continuous-restore +``` + +### 2. Configure S3 bucket details + +Edit `values-hub.yaml` and `values-secondary.yaml` to set your S3 bucket name and region: + +```yaml +# In both values-hub.yaml and values-secondary.yaml, under the trilio-operand app overrides: +overrides: + - name: backupTarget.bucketName + value: + - name: backupTarget.region + value: # for example, us-east-1 +``` + +### 3. Populate secrets + +Create `values-secret.yaml` from the template: + +```bash +cp values-secret.yaml.template ~/values-secret-trilio-continuous-restore.yaml +``` + +Edit `~/values-secret-trilio-continuous-restore.yaml` and fill in your credentials: + +```yaml +secrets: + - name: trilio-license + vaultPrefixes: + - global + fields: + - name: key + value: # single unbroken line, no escape characters + + - name: trilio-s3 + vaultPrefixes: + - global + fields: + - name: accessKey + value: + - name: secretKey + value: +``` + +> Always update secrets in your home directory, never in the repo's `values-secret.yaml.template` so that secrets are never committed to git. + +### 4. Install the pattern + +```bash +./pattern.sh make install +``` + +This command: +1. Bootstraps HashiCorp Vault and loads secrets from `~/values-secret-trilio-continuous-restore.yaml` +2. Installs the Validated Patterns operator on the hub +3. Creates the `ValidatedPattern` CR which triggers ArgoCD to deploy all hub components + +Monitor progress in the ArgoCD UI or by running: + +```bash +oc get application -n openshift-gitops +``` + +All applications should reach `Synced / Healthy` within 10–15 minutes. + +**Alternative: manual secret population by using `oc`** + +To write or rotate secrets directly in HashiCorp Vault without re-running `./pattern.sh make install`: + +```bash +# Extract Vault root token +VAULT_TOKEN=$(oc get secret vaultkeys -n imperative \ + -o jsonpath='{.data.vault_data_json}' | \ + base64 -d | python3 -c "import sys,json; print(json.load(sys.stdin)['root_token'])") + +# Write Trilio license +oc exec -n vault vault-0 -- env VAULT_TOKEN=$VAULT_TOKEN \ + vault kv put secret/global/trilio-license key="" + +# Write S3 credentials +oc exec -n vault vault-0 -- env VAULT_TOKEN=$VAULT_TOKEN \ + vault kv put secret/global/trilio-s3 accessKey="" secretKey="" +``` + +You can also reload secrets from `~/values-secret-trilio-continuous-restore.yaml` by running: + +```bash +./pattern.sh make load-secrets +``` + +### 5. Verify hub deployment + +Check that Trilio is healthy: + +```bash +oc get triliovaultmanager -n trilio-system +# STATUS should be Deployed or Updated + +oc get target -n trilio-system +# STATUS should be Available +``` + +Check the end-to-end DR status (updated automatically by the imperative framework): + +```bash +make dr-status +``` + +Initial run: `trilio-enable-cr` and `trilio-backup` complete within the first two CronJob cycles (~20 minutes). Standard restore follows. All phases `PASS` indicates the hub is fully operational. + +--- + +## Spoke (DR cluster) onboarding + +### 1. Import the DR cluster into ACM + +Import the DR cluster through the ACM console or the `oc` CLI. Note the cluster name assigned during import. + +### 2. Label and onboard + +```bash +make onboard-spoke CLUSTER= +``` + +This labels the cluster with `clusterGroup=secondary`, which triggers ACM to deploy the spoke configuration through ArgoCD. + +After running `make onboard-spoke`, kick the spoke-side ArgoCD application to sync immediately (run on the spoke cluster context): + +```bash +oc patch application.argoproj.io main-trilio-continuous-restore-secondary \ + -n openshift-gitops --type merge \ + -p '{"operation":{"sync":{}}}' +``` + +### 3. Monitor spoke onboarding + +```bash +make spoke-status CLUSTER= +``` + +Expected progression: +1. Trilio operator installs (OLM subscription) +2. TrilioVaultManager deploys (ESO delivers S3 + license secrets) +3. BackupTarget becomes Available (EventTarget pod starts) +4. ConsistentSets begin appearing as hub backups are detected (~10–20 minutes after the hub's CR backup completes) +5. Spoke imperative restore runs automatically after the first ConsistentSet is Available + +The full spoke onboarding sequence typically takes 15–25 minutes from label application to a running TrilioVaultManager. The imperative restore adds another 30–45 minutes on top of that for the first ConsistentSet to appear and the restore to complete. + +### Known: trilio-operand OutOfSync on spoke after onboarding + +ArgoCD may show `trilio-operand` as `OutOfSync / Missing` immediately after spoke onboarding. This is a CRD timing issue — ArgoCD attempts to sync the TrilioVaultManager CR before the Trilio operator has finished registering its Custom Resource Definitions (CRDs). + +The `SkipDryRunOnMissingResource=true` sync option is set in `values-secondary.yaml` to handle this automatically. If the issue persists after 5–10 minutes, manually refresh the ArgoCD application: + +```bash +oc patch application trilio-operand -n main-trilio-continuous-restore-secondary \ + --type merge -p '{"operation":{"sync":{}}}' +``` diff --git a/content/patterns/trilio-continuous-recovery/prerequisites.md b/content/patterns/trilio-continuous-recovery/prerequisites.md new file mode 100644 index 000000000..3ac3c873d --- /dev/null +++ b/content/patterns/trilio-continuous-recovery/prerequisites.md @@ -0,0 +1,44 @@ +--- +title: Prerequisites +weight: 10 +aliases: /trilio-cr/prerequisites/ +--- + +## Prerequisites + +### Clusters + +| Cluster | Role | Minimum size | +|---------|------|-------------| +| Hub | Primary; runs ACM, HashiCorp Vault, ArgoCD, Trilio | 3 worker nodes, 8 vCPU / 32 GB each | +| DR Spoke | Disaster Recovery target | 3 worker nodes, 8 vCPU / 32 GB each | + +Both clusters must: +- Run Red{nbsp}Hat OpenShift 4.18 or later +- Have network access to the shared S3 bucket +- Be reachable by ACM on the hub + +### S3 storage + +A single S3-compatible bucket accessible from both clusters. Required values: +- Bucket name +- Bucket region (must match the bucket's actual region — always set this explicitly) +- Access key and secret key with read/write permissions on the bucket + +### Trilio license + +A valid Trilio for Kubernetes license key. This pattern supports Trilio for Kubernetes version 5.3.0 and later. Obtain a license from [trilio.io](https://trilio.io) or your Trilio representative. + +### Tooling (hub workstation) + +- `oc` CLI logged in to the hub cluster with cluster-admin +- `ansible-navigator` (for manual DR operations) +- `make` +- `git` +- `python3` +- `rhvp.cluster_utils` Ansible collection (for `make install`): + + ```bash + ansible-galaxy collection install community.okd kubernetes.core \ + https://github.com/validatedpatterns/rhvp.cluster_utils/releases/download/v0.0.6/rhvp-cluster_utils-0.0.6.tar.gz + ``` diff --git a/content/patterns/trilio-continuous-recovery/troubleshooting.md b/content/patterns/trilio-continuous-recovery/troubleshooting.md new file mode 100644 index 000000000..3683d63e3 --- /dev/null +++ b/content/patterns/trilio-continuous-recovery/troubleshooting.md @@ -0,0 +1,134 @@ +--- +title: Troubleshooting +weight: 40 +aliases: /trilio-cr/troubleshooting/ +--- + +## Troubleshooting + +### Trilio operator not installing + +```bash +oc get subscription k8s-triliovault -n trilio-system -o yaml +oc get installplan -n trilio-system +``` + +Check that the `certified-operators` CatalogSource is healthy: + +```bash +oc get catalogsource -n openshift-marketplace +``` + +### TrilioVaultManager not reaching Deployed or Updated + +```bash +oc get triliovaultmanager -n trilio-system -o yaml +oc logs -n trilio-system -l app=k8s-triliovault-operator --tail=50 +``` + +Common cause: the license Secret has not been created yet. Check External Secrets Operator (ESO) ExternalSecret status: + +```bash +oc get externalsecret -n trilio-system +``` + +### BackupTarget stuck in Failed + +```bash +oc get target -n trilio-system -o yaml +``` + +Common causes: +- S3 credentials are incorrect or the Secret has not been created by ESO yet +- `backupTarget.region` does not match the bucket's actual region — always set it explicitly + +### No ConsistentSets appearing on the spoke + +1. Verify the EventTarget pod is running: `oc get pods -n trilio-system | grep event` +2. Verify the spoke BackupTarget is Available: `oc get target -n trilio-system` +3. Verify at least one Available backup exists on the hub using the CR BackupPlan: `oc get backup -n wordpress` +4. Check that hub and spoke are running the same Trilio version: `oc get csv -n trilio-system` + +### Imperative jobs stuck in Init:Error + +```bash +# View logs from the failing init container +oc logs -n imperative -c + +# List init containers in order +oc get pod -n imperative -o jsonpath='{.spec.initContainers[*].name}' +``` + +The init container name matches the job name (e.g., `trilio-backup`). Each init container runs one playbook; a failure stops all subsequent jobs. + +### Spoke ArgoCD not syncing after values-secondary.yaml changes + +The spoke application has no automated sync. Kick it manually on the spoke context: + +```bash +oc patch application.argoproj.io main-trilio-continuous-restore-secondary \ + -n openshift-gitops --type merge \ + -p '{"operation":{"sync":{}}}' +``` + +### BackupTarget or TrilioVaultManager perpetually OutOfSync in ArgoCD + +Trilio continuously writes status fields to its own Custom Resources. ArgoCD detects these writes as drift and marks the application `OutOfSync` — even though the configuration is correct. This is expected behavior and does not indicate a problem. + +The Helm chart includes a `ServerSideDiff=true` annotation on Trilio CR templates to suppress this. If you see persistent `OutOfSync` without any configuration changes, verify the annotation is present: + +```bash +oc get application trilio-operand -n openshift-gitops -o jsonpath='{.spec.syncPolicy}' +``` + +### Secrets written to Vault after ArgoCD has already synced + +If ESO ExternalSecrets were created before the Vault secrets were populated, they may be in a `SecretSyncedError` state. Force an immediate re-sync: + +```bash +oc annotate externalsecret trilio-s3-credentials -n trilio-system \ + force-sync=$(date +%s) --overwrite +oc annotate externalsecret trilio-license -n trilio-system \ + force-sync=$(date +%s) --overwrite +``` + +Wait 30 seconds and re-check: + +```bash +oc get externalsecret -n trilio-system +``` + +### Vault root token — how to extract + +The Vault root token and unseal keys are stored in the `vaultkeys` Secret in the `imperative` namespace. Extract the root token: + +```bash +VAULT_TOKEN=$(oc get secret vaultkeys -n imperative \ + -o jsonpath='{.data.vault_data_json}' | \ + base64 -d | python3 -c "import sys,json; print(json.load(sys.stdin)['root_token'])") +echo $VAULT_TOKEN +``` + +> Save the root token and unseal keys before running `offboard-hub` — the `imperative` namespace is deleted during offboard and the Secret is lost. + +--- + +## Operational notes + +### Secret values must be plain text + +Secrets written to HashiCorp Vault must be plain text values, not Base64-encoded. ESO handles Base64 encoding when creating Kubernetes Secrets. If values are pre-encoded, ESO double-encodes them and Trilio receives garbled data, causing the BackupTarget to stay in `Failed` state. + +### TrilioVaultManager healthy states + +Both `Deployed` and `Updated` are healthy TrilioVaultManager states. `Updated` indicates a recent upgrade completed successfully. Monitoring scripts and health checks should accept either value. + +### Imperative job update lag + +When a configuration change is pushed to Git, there is a delay before the imperative CronJob picks it up: + +1. ArgoCD polls Git every ~3 minutes and updates the ConfigMap +2. The CronJob runs every 10 minutes — the next pod starts at the next scheduled tick +3. The pod must mount the updated ConfigMap before the playbook runs + +**Total lag: typically 15–30 minutes from `git push` to effect.** This is normal behavior, not a failure. diff --git a/layouts/_default/_markup/render-codeblock-mermaid.html b/layouts/_default/_markup/render-codeblock-mermaid.html new file mode 100644 index 000000000..42e8abfd8 --- /dev/null +++ b/layouts/_default/_markup/render-codeblock-mermaid.html @@ -0,0 +1,4 @@ +
+  {{ .Inner | htmlEscape | safeHTML }}
+
+{{ .Page.Store.Set "hasMermaid" true }} diff --git a/layouts/_default/baseof.html b/layouts/_default/baseof.html index 76707add3..5e39f200f 100644 --- a/layouts/_default/baseof.html +++ b/layouts/_default/baseof.html @@ -11,6 +11,12 @@ + {{ if .Store.Get "hasMermaid" }} + + {{ end }}