Skip to content

Commit 3dbf6c2

Browse files
committed
add trilio cr pattern to sandbox
1 parent 2306441 commit 3dbf6c2

8 files changed

Lines changed: 591 additions & 0 deletions

File tree

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
---
2+
title: Trilio Continuous Restore
3+
date: 2026-04-08
4+
tier: sandbox
5+
summary: A demonstration of Trilio Continuous Restore for stateful applications
6+
rh_products:
7+
- Red Hat OpenShift Container Platform
8+
- Red Hat OpenShift GitOps
9+
- Red Hat Advanced Cluster Management
10+
partners:
11+
- Trilio
12+
industries:
13+
- General
14+
aliases: /trilio-cr/
15+
links:
16+
github: https://github.com/trilio-demo/trilio-continuous-restore
17+
install: getting-started
18+
bugs: https://github.com/trilio-demo/trilio-continuous-restore/issues
19+
feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform
20+
---
21+
22+
# Trilio Continuous Restore — Red Hat Validated Pattern
23+
24+
## Overview
25+
26+
This Validated Pattern delivers an automated, GitOps-driven Disaster Recovery (DR) solution for stateful applications running on Red Hat OpenShift. It integrates [Trilio for Kubernetes](https://trilio.io) with the [Red Hat Validated Patterns framework](https://validatedpatterns.io) to provide:
27+
28+
- **Automated backup** of stateful workloads on the primary (hub) cluster
29+
- **Continuous Restore** — Trilio's accelerated Recovery Time Objective (RTO) DR path that continuously pre-stages backup data on the DR cluster so that recovery requires only metadata retrieval, not a full data transfer
30+
- **Automated DR testing** — the full backup-to-restore lifecycle runs as a scheduled, self-healing GitOps workflow with no human intervention after initial setup
31+
- **Multi-cluster lifecycle management** via Red Hat Advanced Cluster Management (ACM)
32+
33+
### Use Case
34+
35+
The pattern targets organisations that need a documented, repeatable DR posture for Kubernetes-native workloads — particularly those that must demonstrate RTO/Recovery Point Objective (RPO) targets through regular, automated DR tests rather than annual manual exercises.
36+
37+
A WordPress + MySQL deployment is included as a representative stateful application. It serves as the reference workload for the full backup, restore, and URL-rewrite lifecycle.
38+
39+
---
40+
41+
## Architecture
42+
43+
```mermaid
44+
graph TD
45+
subgraph Git["Git (Source of Truth)"]
46+
values["values-hub.yaml\nvalues-secondary.yaml\ncharts/"]
47+
end
48+
49+
subgraph Hub["Hub Cluster (primary)"]
50+
ACM["ACM"]
51+
ArgoCD["ArgoCD"]
52+
Vault["HashiCorp Vault + ESO"]
53+
Trilio_Hub["Trilio Operator + TVM"]
54+
CronJob["Imperative CronJob\n(DR lifecycle automation)"]
55+
end
56+
57+
subgraph Spoke["DR Cluster (secondary)"]
58+
Trilio_Spoke["Trilio Operator + TVM"]
59+
EventTarget["EventTarget pod\n(pre-stages PVCs)"]
60+
ConsistentSet["ConsistentSet\n(restore point)"]
61+
end
62+
63+
S3["Shared S3 Bucket"]
64+
65+
Git -->|GitOps sync| ArgoCD
66+
ArgoCD --> Trilio_Hub
67+
Vault -->|S3 creds + license| Trilio_Hub
68+
Trilio_Hub -->|backups| S3
69+
ACM -->|provisions| Spoke
70+
S3 -->|EventTarget polls| EventTarget
71+
EventTarget --> ConsistentSet
72+
CronJob -->|restore from ConsistentSet| ConsistentSet
73+
```
74+
75+
### Component Roles
76+
77+
| Component | Where | Role |
78+
|-----------|-------|------|
79+
| Trilio Operator | Hub + Spoke | Installed via Operator Lifecycle Manager (OLM) from the `certified-operators` catalog, channel `5.3.x` |
80+
| TrilioVaultManager | Hub + Spoke | Trilio operand Custom Resource (CR); manages the Trilio data plane |
81+
| Red Hat OpenShift | Hub + Spoke | Container orchestration platform; provides OLM, storage, networking, and the GitOps operator substrate |
82+
| Red Hat OpenShift GitOps (ArgoCD) | Hub + Spoke | GitOps sync engine; all configuration is driven from Git |
83+
| Red Hat Advanced Cluster Management (ACM) | Hub | Cluster lifecycle, policy enforcement, and spoke provisioning |
84+
| Validated Patterns Imperative CronJob | Hub + Spoke | Runs the automated DR lifecycle on a 10-minute schedule |
85+
| BackupTarget | Hub + Spoke | Points to the shared S3 bucket; the spoke BackupTarget has the EventTarget flag set |
86+
| BackupPlan | Hub | Defines backup scope (wordpress namespace), quiesce/unquiesce hooks, and retention |
87+
| CR BackupPlan | Hub | Continuous Restore variant of BackupPlan; drives pre-staging on the spoke |
88+
| EventTarget pod | Spoke | Watches the shared S3 bucket for new backups; pre-stages Persistent Volume Claims (PVCs) locally |
89+
| ConsistentSet | Spoke | Cluster-scoped CR representing a fully pre-staged restore point |
90+
| HashiCorp Vault and External Secrets Operator (ESO) | Hub | Secret management; S3 credentials and Trilio license are never stored in Git |
91+
92+
### How Continuous Restore Works
93+
94+
1. The hub creates a backup using the CR BackupPlan and writes it to the shared S3 storage.
95+
2. The EventTarget pod on the spoke detects the new backup and begins copying volume data locally — ahead of any DR event.
96+
3. When the spoke's imperative job detects an Available ConsistentSet, it submits a Restore CR. Because the data is already local, only backup metadata is fetched — resulting in significantly lower RTO than a standard on-demand restore.
97+
4. The post-restore Hook CR rewrites WordPress database URLs to the DR cluster's ingress domain.
98+
99+
## Links
100+
101+
- [Trilio for Kubernetes documentation](https://docs.trilio.io/kubernetes)
102+
- [Red Hat Validated Patterns](https://validatedpatterns.io)
103+
- [Validated Patterns imperative framework](https://validatedpatterns.io/learn/imperative-actions/)
104+
- [Red Hat Advanced Cluster Management (ACM)](https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes)
105+
- [External Secrets Operator](https://external-secrets.io)
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
---
2+
title: CR Operations
3+
weight: 30
4+
aliases: /trilio-cr/cr-operations/
5+
---
6+
7+
## Operations
8+
9+
### Monitoring DR status
10+
11+
```bash
12+
# Hub — all phases
13+
make dr-status
14+
15+
# Spoke — ConsistentSet and restore status (run on spoke context)
16+
oc get configmap trilio-cr-status -n imperative -o yaml
17+
```
18+
19+
### Automated DR lifecycle
20+
21+
The imperative framework runs continuously on a 10-minute schedule with no manual intervention required. The full lifecycle from a standing start (hub up, spoke just joined) to a completed Continuous Restore typically completes within 30–45 minutes.
22+
23+
**Hub job sequence:**
24+
25+
| Job | What it does | Skips when |
26+
|-----|-------------|------------|
27+
| `trilio-enable-cr` | Creates CR BackupPlan + ContinuousRestore Policy | CR BackupPlan already Available |
28+
| `trilio-cr-backup` | Creates a backup against the CR BackupPlan | Available CR backup exists |
29+
| `trilio-backup` | Creates a standard backup | Available backup exists |
30+
| `trilio-restore-standard` | Restores to `wordpress-restore` on hub | Completed restore exists |
31+
| `trilio-e2e-status` | Writes status ConfigMap; fails until all phases pass | — (always runs) |
32+
33+
**Spoke job sequence (per DR cluster):**
34+
35+
| Job | What it does | Skips when |
36+
|-----|-------------|------------|
37+
| `trilio-cr-status` | Validates ConsistentSet available; writes status ConfigMap | — (always runs; fails until Available) |
38+
| `trilio-cr-restore` | Restores from latest ConsistentSet to `wordpress-restore` | Completed restore exists |
39+
40+
### Manual backup
41+
42+
To trigger a backup outside the automated schedule:
43+
44+
```bash
45+
ansible-navigator run ansible/playbooks/dr-backup.yaml
46+
```
47+
48+
### Manual DR restore
49+
50+
**Standard restore** (from a named backup):
51+
52+
```bash
53+
ansible-navigator run ansible/playbooks/dr-restore.yaml \
54+
-e restore_method=backup \
55+
-e restore_namespace=<target-namespace>
56+
```
57+
58+
**Continuous Restore** (from a pre-staged ConsistentSet on the DR cluster — accelerated RTO):
59+
60+
```bash
61+
ansible-navigator run ansible/playbooks/dr-restore.yaml \
62+
-e restore_method=consistentset \
63+
-e restore_namespace=<target-namespace>
64+
```
65+
66+
Both commands discover the cluster ingress domain automatically and apply the Route hostname transform.
67+
68+
### Offboarding a spoke
69+
70+
```bash
71+
# Step 1 — on the hub context
72+
make unlabel-spoke CLUSTER=<acm-cluster-name>
73+
74+
# Step 2 — on the spoke context
75+
make offboard-spoke CLUSTER=<acm-cluster-name>
76+
```
77+
78+
### Uninstalling the pattern
79+
80+
```bash
81+
# On the hub context
82+
make offboard-hub
83+
```
84+
85+
> Save your HashiCorp Vault root token and unseal keys before running `offboard-hub`. They are stored in the `imperative` namespace which is removed during offboard.
86+
87+
---
88+
89+
## Ansible Playbook Reference
90+
91+
| Playbook | When to use | Key inputs |
92+
|----------|-------------|------------|
93+
| `dr-backup.yaml` | Trigger a manual backup on the hub ||
94+
| `dr-restore.yaml` | Manual restore (backup or ConsistentSet method) | `restore_method`, `restore_namespace`, `source_backup` (optional) |
95+
| `validate-trilio.yaml` | Pre/post-change Trilio health validation ||
96+
| `offboard-spoke.yaml` | Remove spoke-side Trilio resources | `cluster_name` |
97+
| `offboard-hub.yaml` | Full hub pattern teardown ||
98+
99+
Playbooks are run via `ansible-navigator`:
100+
101+
```bash
102+
ansible-navigator run ansible/playbooks/<playbook>.yaml [-e key=value ...]
103+
```
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
---
2+
title: Getting Started
3+
weight: 20
4+
aliases: /trilio-cr/getting-started/
5+
---
6+
7+
## Deployment
8+
9+
### 1. Clone the repository
10+
11+
```bash
12+
git clone https://github.com/trilio-demo/trilio-continuous-restore
13+
cd trilio-continuous-restore
14+
```
15+
16+
### 2. Configure S3 bucket details
17+
18+
Edit `values-hub.yaml` and `values-secondary.yaml` to set your S3 bucket name and region:
19+
20+
```yaml
21+
# In both values-hub.yaml and values-secondary.yaml, under the trilio-operand app overrides:
22+
overrides:
23+
- name: backupTarget.bucketName
24+
value: <your-bucket-name>
25+
- name: backupTarget.region
26+
value: <your-bucket-region> # e.g. us-east-1
27+
```
28+
29+
### 3. Populate secrets
30+
31+
Create `values-secret.yaml` from the template:
32+
33+
```bash
34+
cp values-secret.yaml.template values-secret.yaml
35+
```
36+
37+
Edit `values-secret.yaml` and fill in your credentials:
38+
39+
```yaml
40+
secrets:
41+
- name: trilio-license
42+
vaultPrefixes:
43+
- global
44+
fields:
45+
- name: key
46+
value: <your-trilio-license-key> # single unbroken line, no escape characters
47+
48+
- name: trilio-s3
49+
vaultPrefixes:
50+
- global
51+
fields:
52+
- name: accessKey
53+
value: <your-s3-access-key>
54+
- name: secretKey
55+
value: <your-s3-secret-key>
56+
```
57+
58+
> `values-secret.yaml` is listed in `.gitignore` and must never be committed to Git.
59+
60+
### 4. Install the pattern
61+
62+
```bash
63+
make install
64+
```
65+
66+
This command:
67+
1. Bootstraps HashiCorp Vault and loads secrets from `values-secret.yaml`
68+
2. Installs the Validated Patterns operator on the hub
69+
3. Creates the `ValidatedPattern` CR which triggers ArgoCD to deploy all hub components
70+
71+
Monitor progress in the ArgoCD UI or via:
72+
73+
```bash
74+
oc get application -n openshift-gitops
75+
```
76+
77+
All applications should reach `Synced / Healthy` within 10–15 minutes.
78+
79+
**Alternative: manual secret population via `oc`**
80+
81+
To write or rotate secrets directly in HashiCorp Vault without re-running `make install`:
82+
83+
```bash
84+
# Extract Vault root token
85+
VAULT_TOKEN=$(oc get secret vaultkeys -n imperative \
86+
-o jsonpath='{.data.vault_data_json}' | \
87+
base64 -d | python3 -c "import sys,json; print(json.load(sys.stdin)['root_token'])")
88+
89+
# Write Trilio license
90+
oc exec -n vault vault-0 -- env VAULT_TOKEN=$VAULT_TOKEN \
91+
vault kv put secret/global/trilio-license key="<your-license-key>"
92+
93+
# Write S3 credentials
94+
oc exec -n vault vault-0 -- env VAULT_TOKEN=$VAULT_TOKEN \
95+
vault kv put secret/global/trilio-s3 accessKey="<key>" secretKey="<secret>"
96+
```
97+
98+
### 5. Verify hub deployment
99+
100+
Check that Trilio is healthy:
101+
102+
```bash
103+
oc get triliovaultmanager -n trilio-system
104+
# STATUS should be Deployed or Updated
105+
106+
oc get target -n trilio-system
107+
# STATUS should be Available
108+
```
109+
110+
Check the end-to-end DR status (updated automatically by the imperative framework):
111+
112+
```bash
113+
make dr-status
114+
```
115+
116+
Initial run: `trilio-enable-cr` and `trilio-backup` will complete within the first two CronJob cycles (~20 minutes). Standard restore follows. All phases `PASS` indicates the hub is fully operational.
117+
118+
---
119+
120+
## Spoke (DR Cluster) Onboarding
121+
122+
### 1. Import the DR cluster into ACM
123+
124+
Import the DR cluster via the ACM console or `oc` CLI. Note the cluster name assigned during import.
125+
126+
### 2. Label and onboard
127+
128+
```bash
129+
make onboard-spoke CLUSTER=<acm-cluster-name>
130+
```
131+
132+
This labels the cluster with `clusterGroup=secondary`, which triggers ACM to deploy the spoke configuration via ArgoCD.
133+
134+
After running `make onboard-spoke`, kick the spoke-side ArgoCD application to sync immediately (run on the spoke cluster context):
135+
136+
```bash
137+
oc patch application.argoproj.io main-trilio-continuous-restore-secondary \
138+
-n openshift-gitops --type merge \
139+
-p '{"operation":{"sync":{}}}'
140+
```
141+
142+
### 3. Monitor spoke onboarding
143+
144+
```bash
145+
make spoke-status CLUSTER=<acm-cluster-name>
146+
```
147+
148+
Expected progression:
149+
1. Trilio operator installs (OLM subscription)
150+
2. TrilioVaultManager deploys (ESO delivers S3 + license secrets)
151+
3. BackupTarget becomes Available (EventTarget pod starts)
152+
4. ConsistentSets begin appearing as hub backups are detected (~10–20 minutes after the hub's CR backup completes)
153+
5. Spoke imperative restore runs automatically once the first ConsistentSet is Available
154+
155+
The full spoke onboarding sequence typically takes 15–25 minutes from label application to a running TrilioVaultManager. The imperative restore adds another 30–45 minutes on top of that for the first ConsistentSet to appear and the restore to complete.
156+
157+
### Known: trilio-operand OutOfSync on spoke after onboarding
158+
159+
ArgoCD may show `trilio-operand` as `OutOfSync / Missing` immediately after spoke onboarding. This is a CRD timing issue — ArgoCD attempts to sync the TrilioVaultManager CR before the Trilio operator has finished registering its Custom Resource Definitions (CRDs).
160+
161+
The `SkipDryRunOnMissingResource=true` sync option is set in `values-secondary.yaml` to handle this automatically. If the issue persists after 5–10 minutes, manually refresh the ArgoCD application:
162+
163+
```bash
164+
oc patch application trilio-operand -n main-trilio-continuous-restore-secondary \
165+
--type merge -p '{"operation":{"sync":{}}}'
166+
```

0 commit comments

Comments
 (0)