|
| 1 | +# etcd/raft Migration Operations |
| 2 | + |
| 3 | +## Scope |
| 4 | + |
| 5 | +This runbook covers the supported migration path from the legacy HashiCorp Raft |
| 6 | +runtime to the default `etcd/raft` runtime. |
| 7 | + |
| 8 | +Use it when: |
| 9 | + |
| 10 | +1. migrating an existing cluster that already has HashiCorp-backed data dirs |
| 11 | +2. validating a fresh etcd-backed deployment before production cutover |
| 12 | +3. rolling out nodes with the repo's default runtime and operational scripts |
| 13 | + |
| 14 | +This runbook does not support mixed-engine Raft groups. A single group must run |
| 15 | +entirely on one runtime at a time. |
| 16 | + |
| 17 | +## Current Defaults |
| 18 | + |
| 19 | +As of the etcd rollout: |
| 20 | + |
| 21 | +1. `go run .` starts with `--raftEngine=etcd` by default |
| 22 | +2. Jepsen workloads default to `--raft-engine etcd` |
| 23 | +3. `scripts/rolling-update.sh` defaults `RAFT_ENGINE=etcd` |
| 24 | + |
| 25 | +If you still operate a legacy HashiCorp cluster, set the engine explicitly and |
| 26 | +keep using the original data dirs until the offline migration below is complete. |
| 27 | + |
| 28 | +## Preconditions |
| 29 | + |
| 30 | +Before migrating an existing cluster: |
| 31 | + |
| 32 | +1. Confirm the source cluster is healthy and fully replicated. |
| 33 | +2. Inventory every Raft group, node ID, and advertised raft address. |
| 34 | +3. Record the Redis and S3 routing maps if you use leader-local routing. |
| 35 | +4. Stop application writes and plan for a full maintenance window. |
| 36 | +5. Back up the existing data dirs before creating the new etcd dirs. |
| 37 | + |
| 38 | +## Important Constraints |
| 39 | + |
| 40 | +1. Elastickv writes a `raft-engine` marker into each data dir. |
| 41 | +2. Reusing the same data dir across `hashicorp` and `etcd` is intentionally rejected. |
| 42 | +3. The supported migration is offline. Stop the old cluster first, seed fresh |
| 43 | + etcd dirs, then start the new cluster. |
| 44 | +4. Rollback means restarting the old HashiCorp cluster from the untouched old |
| 45 | + dirs. There is no mixed-engine rollback in place. |
| 46 | + |
| 47 | +## Fresh Cluster Bootstrap |
| 48 | + |
| 49 | +For a brand-new cluster, use the default runtime directly: |
| 50 | + |
| 51 | +```bash |
| 52 | +go run . \ |
| 53 | + --address "10.0.0.11:50051" \ |
| 54 | + --redisAddress "10.0.0.11:6379" \ |
| 55 | + --raftId "n1" \ |
| 56 | + --raftBootstrapMembers "n1=10.0.0.11:50051,n2=10.0.0.12:50051,n3=10.0.0.13:50051" |
| 57 | +``` |
| 58 | + |
| 59 | +Start one process per node with the same `--raftBootstrapMembers` set and a |
| 60 | +node-local `--raftDataDir`. |
| 61 | + |
| 62 | +## Offline Migration from HashiCorp Raft |
| 63 | + |
| 64 | +### Single-group cluster |
| 65 | + |
| 66 | +1. Stop every Elastickv process in the cluster. |
| 67 | +2. Keep the old HashiCorp data dirs intact. |
| 68 | +3. Create a fresh etcd data dir for each node. |
| 69 | +4. Run the migrator once per node: |
| 70 | + |
| 71 | +```bash |
| 72 | +go run ./cmd/etcd-raft-migrate \ |
| 73 | + --fsm-store /var/lib/elastickv/n1/fsm.db \ |
| 74 | + --dest /var/lib/elastickv-etcd/n1 \ |
| 75 | + --peers n1=10.0.0.11:50051,n2=10.0.0.12:50051,n3=10.0.0.13:50051 |
| 76 | +``` |
| 77 | + |
| 78 | +5. Repeat for `n2`, `n3`, and every other cluster member with the same peer map. |
| 79 | +6. Start the new cluster with `--raftDataDir` pointing at the new etcd dir. |
| 80 | + |
| 81 | +### Multi-group cluster |
| 82 | + |
| 83 | +Run the same procedure once per node and once per group directory. For example, |
| 84 | +if a node has `group-1` and `group-2`, migrate both: |
| 85 | + |
| 86 | +```bash |
| 87 | +go run ./cmd/etcd-raft-migrate \ |
| 88 | + --fsm-store /var/lib/elastickv/n1/group-1/fsm.db \ |
| 89 | + --dest /var/lib/elastickv-etcd/n1/group-1 \ |
| 90 | + --peers n1=10.0.0.11:50051,n2=10.0.0.12:50051,n3=10.0.0.13:50051 |
| 91 | + |
| 92 | +go run ./cmd/etcd-raft-migrate \ |
| 93 | + --fsm-store /var/lib/elastickv/n1/group-2/fsm.db \ |
| 94 | + --dest /var/lib/elastickv-etcd/n1/group-2 \ |
| 95 | + --peers n1=10.0.0.11:50061,n2=10.0.0.12:50061,n3=10.0.0.13:50061 |
| 96 | +``` |
| 97 | + |
| 98 | +Use the group-specific raft addresses for each peer set. |
| 99 | + |
| 100 | +## Startup Validation |
| 101 | + |
| 102 | +After migration: |
| 103 | + |
| 104 | +1. Start all nodes with `--raftEngine etcd` or rely on the default. |
| 105 | +2. Verify the cluster forms with the expected membership. |
| 106 | +3. Confirm reads and writes succeed through the primary protocol you use. |
| 107 | + |
| 108 | +Useful checks: |
| 109 | + |
| 110 | +```bash |
| 111 | +go run ./cmd/raftadmin 10.0.0.11:50051 state |
| 112 | +go run ./cmd/raftadmin 10.0.0.11:50051 leader |
| 113 | +go run ./cmd/client/client.go |
| 114 | +``` |
| 115 | + |
| 116 | +For Redis: |
| 117 | + |
| 118 | +```bash |
| 119 | +redis-cli -h 10.0.0.11 -p 6379 SET migration:smoke ok |
| 120 | +redis-cli -h 10.0.0.12 -p 6379 GET migration:smoke |
| 121 | +``` |
| 122 | + |
| 123 | +## CI and Jepsen Validation |
| 124 | + |
| 125 | +The repo now validates the default etcd runtime in CI: |
| 126 | + |
| 127 | +1. `.github/workflows/jepsen-test.yml` launches a 3-node etcd-backed cluster |
| 128 | +2. Jepsen local workloads default to `--raft-engine etcd` |
| 129 | + |
| 130 | +Before production rollout, run at least: |
| 131 | + |
| 132 | +```bash |
| 133 | +GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... |
| 134 | +GOCACHE=$(pwd)/.cache GOLANGCI_LINT_CACHE=$(pwd)/.golangci-cache golangci-lint run ./... --timeout=5m |
| 135 | +``` |
| 136 | + |
| 137 | +For local Jepsen validation: |
| 138 | + |
| 139 | +```bash |
| 140 | +cd jepsen |
| 141 | +HOME=$(pwd)/tmp-home \ |
| 142 | +LEIN_HOME=$(pwd)/.lein \ |
| 143 | +LEIN_JVM_OPTS="-Duser.home=$(pwd)/tmp-home" \ |
| 144 | +/tmp/lein run -m elastickv.redis-workload --local --raft-engine etcd |
| 145 | +``` |
| 146 | + |
| 147 | +## Rolling Operations After Cutover |
| 148 | + |
| 149 | +After a cluster is already on etcd: |
| 150 | + |
| 151 | +1. `scripts/rolling-update.sh` defaults to `RAFT_ENGINE=etcd` |
| 152 | +2. `cmd/raftadmin` can be used for `add_voter`, `remove_server`, and leadership transfer |
| 153 | +3. Keep `RAFT_ENGINE=hashicorp` only for legacy nodes that have not migrated yet |
| 154 | + |
| 155 | +## Rollback |
| 156 | + |
| 157 | +Rollback is operationally simple but requires the old cluster to be preserved: |
| 158 | + |
| 159 | +1. Stop the etcd-backed cluster. |
| 160 | +2. Do not reuse the etcd data dirs with the HashiCorp runtime. |
| 161 | +3. Restart the old HashiCorp cluster from the original pre-migration dirs. |
| 162 | + |
| 163 | +If the old dirs were modified or discarded, rollback is no longer supported. |
0 commit comments