Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions content/patterns/trilio-continuous-recovery/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
title: Trilio Continuous Restore
date: 2026-04-08
tier: sandbox
summary: A demonstration of Trilio Continuous Restore for stateful applications
rh_products:
- Red{nbsp}Hat OpenShift Container Platform
- Red{nbsp}Hat OpenShift GitOps
- Red{nbsp}Hat Advanced Cluster Management
partners:
- Trilio
industries:
- General
aliases: /trilio-cr/
links:
github: https://github.com/trilio-demo/trilio-continuous-restore
install: getting-started
bugs: https://github.com/trilio-demo/trilio-continuous-restore/issues
feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform
---

# Trilio Continuous Restore — Red{nbsp}Hat Validated Pattern

## Overview

This Validated Pattern delivers an automated, GitOps-driven Disaster Recovery (DR) solution for stateful applications running on Red{nbsp}Hat OpenShift. By integrating [Trilio for Kubernetes](https://trilio.io) with the [Red{nbsp}Hat Validated Patterns framework](https://validatedpatterns.io), the pattern delivers:

- **Automated backup** of stateful workloads on the primary (hub) cluster
- **Continuous Restore** — Trilio's accelerated Recovery Time Objective (RTO) DR path that continuously pre-stages backup data on the DR cluster so that recovery requires only metadata retrieval, not a full data transfer
- **Automated DR testing** — the full backup-to-restore lifecycle runs as a scheduled, self-healing GitOps workflow with no human intervention after initial setup
- **Multi-cluster lifecycle management** through Red{nbsp}Hat Advanced Cluster Management (ACM)

### Use case

The pattern targets organizations that need a documented, repeatable DR posture for Kubernetes-native workloads — particularly those that must demonstrate RTO/Recovery Point Objective (RPO) targets through regular, automated DR tests rather than annual manual exercises.

A WordPress + MySQL deployment is included as a representative stateful application. It serves as the reference workload for the full backup, restore, and URL-rewrite lifecycle.

---

## Architecture

```mermaid
graph TD
subgraph Git["Git (Source of Truth)"]
values["values-hub.yaml\nvalues-secondary.yaml\ncharts/"]
end

subgraph Hub["Hub Cluster (primary)"]
ACM["ACM"]
ArgoCD["ArgoCD"]
Vault["HashiCorp Vault + ESO"]
Trilio_Hub["Trilio Operator + TVM"]
CronJob["Imperative CronJob\n(DR lifecycle automation)"]
end

subgraph Spoke["DR Cluster (secondary)"]
Trilio_Spoke["Trilio Operator + TVM"]
EventTarget["EventTarget pod\n(pre-stages PVCs)"]
ConsistentSet["ConsistentSet\n(restore point)"]
end

S3["Shared S3 Bucket"]

Git -->|GitOps sync| ArgoCD
ArgoCD --> Trilio_Hub
Vault -->|S3 creds + license| Trilio_Hub
Trilio_Hub -->|backups| S3
ACM -->|provisions| Spoke
S3 -->|EventTarget polls| EventTarget
EventTarget --> ConsistentSet
CronJob -->|restore from ConsistentSet| ConsistentSet
```

### Component roles

| Component | Where | Role |
|-----------|-------|------|
| Trilio Operator | Hub + Spoke | Installed through Operator Lifecycle Manager (OLM) from the `certified-operators` catalog, channel `5.3.x` |
| TrilioVaultManager | Hub + Spoke | Trilio operand Custom Resource (CR); manages the Trilio data plane |
| Red{nbsp}Hat OpenShift | Hub + Spoke | Container orchestration platform; provides OLM, storage, networking, and the GitOps operator substrate |
| Red{nbsp}Hat OpenShift GitOps (ArgoCD) | Hub + Spoke | GitOps sync engine; all configuration is driven from Git |
| Red{nbsp}Hat Advanced Cluster Management (ACM) | Hub | Cluster lifecycle, policy enforcement, and spoke provisioning |
| Validated Patterns Imperative CronJob | Hub + Spoke | Runs the automated DR lifecycle on a 10-minute schedule |
| BackupTarget | Hub + Spoke | Points to the shared S3 bucket; the spoke BackupTarget has the EventTarget flag set |
| BackupPlan | Hub | Defines backup scope (wordpress namespace), quiesce/unquiesce hooks, and retention |
| CR BackupPlan | Hub | Continuous Restore variant of BackupPlan; drives pre-staging on the spoke |
| EventTarget pod | Spoke | Watches the shared S3 bucket for new backups; pre-stages Persistent Volume Claims (PVCs) locally |
| ConsistentSet | Spoke | Cluster-scoped CR representing a fully pre-staged restore point |
| HashiCorp Vault and External Secrets Operator (ESO) | Hub | Secret management; S3 credentials and Trilio license are never stored in Git |

### How Continuous Restore works

1. The hub creates a backup using the CR BackupPlan and writes it to the shared S3 storage.
2. The EventTarget pod on the spoke detects the new backup and begins copying volume data locally — ahead of any DR event.
3. When the spoke's imperative job detects an Available ConsistentSet, it submits a Restore CR. Because the data is already local, only backup metadata is fetched — resulting in significantly lower RTO than a standard on-demand restore.
4. The post-restore Hook CR rewrites WordPress database URLs to the DR cluster's ingress domain.

## Links

- [Trilio for Kubernetes documentation](https://docs.trilio.io/kubernetes)
- [Red{nbsp}Hat Validated Patterns](https://validatedpatterns.io)
- [Validated Patterns imperative framework](https://validatedpatterns.io/learn/imperative-actions/)
- [Red{nbsp}Hat Advanced Cluster Management (ACM)](https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes)
- [External Secrets Operator](https://external-secrets.io)

## Next steps

- [Prerequisites](prerequisites)
- [Getting started](getting-started)
- [CR operations](cr-operations)
- [Troubleshooting](troubleshooting)
103 changes: 103 additions & 0 deletions content/patterns/trilio-continuous-recovery/cr-operations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
title: CR operations
weight: 30
aliases: /trilio-cr/cr-operations/
---

## Operations

### Monitoring DR status

```bash
# Hub — all phases
make dr-status

# Spoke — ConsistentSet and restore status (run on spoke context)
oc get configmap trilio-cr-status -n imperative -o yaml
```

### Automated DR lifecycle

The imperative framework runs continuously on a 10-minute schedule with no manual intervention required. The full lifecycle from a standing start (hub up, spoke just joined) to a completed Continuous Restore typically completes within 30–45 minutes.

**Hub job sequence:**

| Job | What it does | Skips when |
|-----|-------------|------------|
| `trilio-enable-cr` | Creates CR BackupPlan + ContinuousRestore Policy | CR BackupPlan already Available |
| `trilio-cr-backup` | Creates a backup against the CR BackupPlan | Available CR backup exists |
| `trilio-backup` | Creates a standard backup | Available backup exists |
| `trilio-restore-standard` | Restores to `wordpress-restore` on hub | Completed restore exists |
| `trilio-e2e-status` | Writes status ConfigMap; fails until all phases pass | — (always runs) |

**Spoke job sequence (per DR cluster):**

| Job | What it does | Skips when |
|-----|-------------|------------|
| `trilio-cr-status` | Validates ConsistentSet available; writes status ConfigMap | — (always runs; fails until Available) |
| `trilio-cr-restore` | Restores from latest ConsistentSet to `wordpress-restore` | Completed restore exists |

### Manual backup

To trigger a backup outside the automated schedule:

```bash
ansible-navigator run ansible/playbooks/dr-backup.yaml
```

### Manual DR restore

**Standard restore** (from a named backup):

```bash
ansible-navigator run ansible/playbooks/dr-restore.yaml \
-e restore_method=backup \
-e restore_namespace=<target-namespace>
```

**Continuous Restore** (from a pre-staged ConsistentSet on the DR cluster — accelerated RTO):

```bash
ansible-navigator run ansible/playbooks/dr-restore.yaml \
-e restore_method=consistentset \
-e restore_namespace=<target-namespace>
```

Both commands discover the cluster ingress domain automatically and apply the Route hostname transform.

### Offboarding a spoke

```bash
# Step 1 — on the hub context
make unlabel-spoke CLUSTER=<acm-cluster-name>

# Step 2 — on the spoke context
make offboard-spoke CLUSTER=<acm-cluster-name>
```

### Uninstalling the pattern

```bash
# On the hub context
make offboard-hub
```

> Save your HashiCorp Vault root token and unseal keys before running `offboard-hub`. They are stored in the `imperative` namespace which is removed during offboard.

---

## Ansible playbook reference

| Playbook | When to use | Key inputs |
|----------|-------------|------------|
| `dr-backup.yaml` | Trigger a manual backup on the hub | — |
| `dr-restore.yaml` | Manual restore (backup or ConsistentSet method) | `restore_method`, `restore_namespace`, `source_backup` (optional) |
| `validate-trilio.yaml` | Pre/post-change Trilio health validation | — |
| `offboard-spoke.yaml` | Remove spoke-side Trilio resources | `cluster_name` |
| `offboard-hub.yaml` | Full hub pattern teardown | — |

Playbooks are run by using `ansible-navigator`:

```bash
ansible-navigator run ansible/playbooks/<playbook>.yaml [-e key=value ...]
```
174 changes: 174 additions & 0 deletions content/patterns/trilio-continuous-recovery/getting-started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
---
title: Getting started
weight: 20
aliases: /trilio-cr/getting-started/
---

# Deploying the pattern

## Deployment

### 1. Clone the repository

```bash
git clone https://github.com/trilio-demo/trilio-continuous-restore
cd trilio-continuous-restore
```

### 2. Configure S3 bucket details

Edit `values-hub.yaml` and `values-secondary.yaml` to set your S3 bucket name and region:

```yaml
# In both values-hub.yaml and values-secondary.yaml, under the trilio-operand app overrides:
overrides:
- name: backupTarget.bucketName
value: <your-bucket-name>
- name: backupTarget.region
value: <your-bucket-region> # for example, us-east-1
```

### 3. Populate secrets

Create `values-secret.yaml` from the template:

```bash
cp values-secret.yaml.template ~/values-secret-trilio-continuous-restore.yaml
```

Edit `~/values-secret-trilio-continuous-restore.yaml` and fill in your credentials:

```yaml
secrets:
- name: trilio-license
vaultPrefixes:
- global
fields:
- name: key
value: <your-trilio-license-key> # single unbroken line, no escape characters

- name: trilio-s3
vaultPrefixes:
- global
fields:
- name: accessKey
value: <your-s3-access-key>
- name: secretKey
value: <your-s3-secret-key>
```

> Always update secrets in your home directory, never in the repo's `values-secret.yaml.template` so that secrets are never committed to git.

### 4. Install the pattern

```bash
./pattern.sh make install
```

This command:
1. Bootstraps HashiCorp Vault and loads secrets from `~/values-secret-trilio-continuous-restore.yaml`
2. Installs the Validated Patterns operator on the hub
3. Creates the `ValidatedPattern` CR which triggers ArgoCD to deploy all hub components

Monitor progress in the ArgoCD UI or by running:

```bash
oc get application -n openshift-gitops
```

All applications should reach `Synced / Healthy` within 10–15 minutes.

**Alternative: manual secret population by using `oc`**

To write or rotate secrets directly in HashiCorp Vault without re-running `./pattern.sh make install`:

```bash
# Extract Vault root token
VAULT_TOKEN=$(oc get secret vaultkeys -n imperative \
-o jsonpath='{.data.vault_data_json}' | \
base64 -d | python3 -c "import sys,json; print(json.load(sys.stdin)['root_token'])")

# Write Trilio license
oc exec -n vault vault-0 -- env VAULT_TOKEN=$VAULT_TOKEN \
vault kv put secret/global/trilio-license key="<your-license-key>"

# Write S3 credentials
oc exec -n vault vault-0 -- env VAULT_TOKEN=$VAULT_TOKEN \
vault kv put secret/global/trilio-s3 accessKey="<key>" secretKey="<secret>"
```

You can also reload secrets from `~/values-secret-trilio-continuous-restore.yaml` by running:

```bash
./pattern.sh make load-secrets
```

### 5. Verify hub deployment

Check that Trilio is healthy:

```bash
oc get triliovaultmanager -n trilio-system
# STATUS should be Deployed or Updated

oc get target -n trilio-system
# STATUS should be Available
```

Check the end-to-end DR status (updated automatically by the imperative framework):

```bash
make dr-status
```

Initial run: `trilio-enable-cr` and `trilio-backup` complete within the first two CronJob cycles (~20 minutes). Standard restore follows. All phases `PASS` indicates the hub is fully operational.

---

## Spoke (DR cluster) onboarding

### 1. Import the DR cluster into ACM

Import the DR cluster through the ACM console or the `oc` CLI. Note the cluster name assigned during import.

### 2. Label and onboard

```bash
make onboard-spoke CLUSTER=<acm-cluster-name>
```

This labels the cluster with `clusterGroup=secondary`, which triggers ACM to deploy the spoke configuration through ArgoCD.

After running `make onboard-spoke`, kick the spoke-side ArgoCD application to sync immediately (run on the spoke cluster context):

```bash
oc patch application.argoproj.io main-trilio-continuous-restore-secondary \
-n openshift-gitops --type merge \
-p '{"operation":{"sync":{}}}'
```

### 3. Monitor spoke onboarding

```bash
make spoke-status CLUSTER=<acm-cluster-name>
```

Expected progression:
1. Trilio operator installs (OLM subscription)
2. TrilioVaultManager deploys (ESO delivers S3 + license secrets)
3. BackupTarget becomes Available (EventTarget pod starts)
4. ConsistentSets begin appearing as hub backups are detected (~10–20 minutes after the hub's CR backup completes)
5. Spoke imperative restore runs automatically after the first ConsistentSet is Available

The full spoke onboarding sequence typically takes 15–25 minutes from label application to a running TrilioVaultManager. The imperative restore adds another 30–45 minutes on top of that for the first ConsistentSet to appear and the restore to complete.

### Known: trilio-operand OutOfSync on spoke after onboarding

ArgoCD may show `trilio-operand` as `OutOfSync / Missing` immediately after spoke onboarding. This is a CRD timing issue — ArgoCD attempts to sync the TrilioVaultManager CR before the Trilio operator has finished registering its Custom Resource Definitions (CRDs).

The `SkipDryRunOnMissingResource=true` sync option is set in `values-secondary.yaml` to handle this automatically. If the issue persists after 5–10 minutes, manually refresh the ArgoCD application:

```bash
oc patch application trilio-operand -n main-trilio-continuous-restore-secondary \
--type merge -p '{"operation":{"sync":{}}}'
```
Loading