[controller] Mark rollback versions as ROLLED_BACK instead of ERROR by misyel · Pull Request #2688 · linkedin/venice

misyel · 2026-04-02T22:19:20Z

Problem Statement

When a version is rolled back in Venice, the controller marks it as ERROR. This is semantically incorrect — a rollback is an intentional operational action, not a failure. It also causes the rolled-back version to be deleted immediately by the backup version cleanup service (since ERROR versions pass canDelete()), removing the ability to roll forward to that version if the rollback was premature.

There is no way to distinguish between a version that genuinely failed and one that was intentionally rolled back, and no retention window to allow operators to reverse a rollback decision.

Solution

Introduce a new ROLLED_BACK version status that accurately represents versions that were previously ONLINE but have been rolled back.

Status tracking:

Add ROLLED_BACK(8) to the VersionStatus enum and gRPC proto
Child controllers mark the previous version as ROLLED_BACK instead of ERROR during rollback
Parent controller polls child regions (with exponential backoff) to set the parent version to ROLLED_BACK (full rollback) or PARTIALLY_ONLINE (partial/region-filtered rollback)

Retention-based cleanup:

ROLLED_BACK versions are retained for 24 hours (configurable) before deletion, giving operators a window to roll forward
A dedicated cleanupRolledBackVersions method runs independently of the normal backup cleanup path
ROLLED_BACK versions are fully excluded from the normal cleanup path — version count checks, repush chain logic, broken chain fallback, and AbstractStore.retrieveVersionsToDelete all skip them
If all delete attempts fail, the method returns false so normal cleanup isn't suppressed for the cycle

Code changes

Added new code behind a config. If so list the config names and their default values in the PR description.
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

New unit tests added.
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

New unit tests

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.

Copilot

Pull request overview

This PR introduces a new ROLLED_BACK version status to distinguish intentional rollbacks from genuine failures, and updates controller rollback + cleanup logic so rolled-back versions are retained (instead of being immediately treated as deletable ERROR versions).

Changes:

Added ROLLED_BACK to VersionStatus and controller gRPC proto enum.
Updated child rollback logic to mark the previous current version as ROLLED_BACK, and added parent polling to set the parent version status to ROLLED_BACK or PARTIALLY_ONLINE.
Updated backup version cleanup to exclude rolled-back versions from normal deletion and to delete them via a separate retention-based path; added unit tests for this behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java	Child controller rollback marks the rolled-back version as `ROLLED_BACK` (was `ERROR`).
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java	Parent controller polls child regions post-rollback and updates parent version status to `ROLLED_BACK`/`PARTIALLY_ONLINE`.
services/venice-controller/src/main/java/com/linkedin/venice/controller/StoreBackupVersionCleanupService.java	Adds a rolled-back retention cleanup path and excludes rolled-back versions from normal retention logic.
services/venice-controller/src/test/java/com/linkedin/venice/controller/TestStoreBackupVersionCleanupService.java	Adds unit tests covering rolled-back retention behavior and exclusion from normal cleanup readiness/deletion.
internal/venice-common/src/main/java/com/linkedin/venice/meta/VersionStatus.java	Adds `ROLLED_BACK(8)` and a helper `isVersionRolledBack`.
internal/venice-common/src/main/java/com/linkedin/venice/meta/AbstractStore.java	Prevents `ROLLED_BACK` versions from being deleted via `retrieveVersionsToDelete`.
internal/venice-common/src/main/proto/controller/ControllerGrpcRequestContext.proto	Adds `ROLLED_BACK = 8` to `VersionStatusGrpc`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mynameborat · 2026-04-10T21:00:23Z

I think the idea to have ROLLED_BACK state separately is fine. I am worried about another code path and use case to account for in the StoreBackupVersionCleanupService. Can we not consolidate the cleanup behavior to any other versions?

Please share any insights into this decision of separating out.

mynameborat · 2026-04-10T22:17:34Z

+  /**
+   * Version was previously ONLINE but has been rolled back. It is no longer serving read requests.
+   */
+  ROLLED_BACK(8);


During a mixed deployment window, an old controller reading a store with status 8 will hit a value it doesn't recognize.

What does the deserialization path (presumably VersionStatus.fromValue(int)) do with an unknown value today — does it throw, return null, or have a fallback?

Can you trace through the deserialization path and confirm it's safe, or add an UNKNOWN fallback for forward compatibility before rolling this out?

Good catch - it will cause deserialization errors since there is no fallback value. I defaulted to NOT_CREATED for de-serialization exceptions because adding an UNKNOWN status will also run into this issue unless UNKNOWN is deployed fully before ROLLED_BACK is deployed

majisourav99 · 2026-04-10T22:57:58Z

+   * Retention time in milliseconds for rolled-back versions before they are eligible for deletion.
+   * Defaults to 24 hours.
+   */
+  public static final String CONTROLLER_ROLLED_BACK_VERSION_RETENTION_MS =


may be we want to consolidate the min retention time with this ? otherwise there will be too many retention times

Good callout on config proliferation. I considered consolidating with MIN_CLEANUP_DELAY_MS but they serve different purposes:

MIN_CLEANUP_DELAY_MS (default 1h) is a safety floor on the entire cleanup path — it prevents any backup version from being deleted before routers/servers have time to switch to the new current version. It's a store-wide gate, not a per-version retention. If we align this one to 24hr too, we won't be able to delete any store version for 24hrs which may be too long

ROLLED_BACK_VERSION_RETENTION_MS (default 24h) is a per-version retention window — it keeps rolled-back versions around so operators have time to reverse a premature rollback.

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

misyel · 2026-04-13T22:18:16Z

I think the idea to have ROLLED_BACK state separately is fine. I am worried about another code path and use case to account for in the StoreBackupVersionCleanupService. Can we not consolidate the cleanup behavior to any other versions?

@mynameborat Thanks for raising this — the concern about maintaining a separate code path is fair.

Here's how the existing cleanup paths work today and why ROLLED_BACK doesn't fit cleanly into either:

Path 1 — Immediate deletion (canDelete() → ERROR/KILLED): Versions with canDelete() == true are collected first and deleted after only the minBackupVersionCleanupDelay (default 1h). This is too aggressive for ROLLED_BACK as DVC needs time to rebootstrap the future version and CDC needs time to get the new checkpoints
Path 2 — Standard backup retention (ONLINE versions): Gated by multiple preconditions: readiness count (>1 non-current version), per-store or default retention (7 days), repush chain logic, and optional router/server metadata validation. This is too long for ROLLED_BACK (7 days vs the 24h we want), and the readiness count / repush chain logic doesn't apply — a rolled-back version isn't a "backup" in the same sense
Current approach — Dedicated ROLLED_BACK retention: A separate cleanupRolledBackVersions method with its own configurable retention (default 24h) that runs before the standard paths. ROLLED_BACK is excluded from the normal paths via filters in the readiness count, repush chain and broken chain fallback

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T22:44:42Z

+      if (VersionStatus.isVersionRolledBack(version.getStatus())) {
+        // ROLLED_BACK versions are retained and cleaned up by StoreBackupVersionCleanupService
+        // with a dedicated retention period. Skip them here so the retention window is honored.
+        // Note: if the backup version retention-based cleanup service is disabled for the cluster,
+        // ROLLED_BACK versions will not be automatically deleted by this path. The admin tool's
+        // deleteOldVersion can still be used for manual cleanup in that case.
+        continue;
+      } else if (VersionStatus.canDelete(version.getStatus())) { // ERROR and KILLED versions are always deleted


ROLLED_BACK versions are unconditionally skipped from retrieveVersionsToDelete(), but the only automatic cleanup path for them appears to be StoreBackupVersionCleanupService, which is gated by controller.backup.version.retention.based.cleanup.enabled (default false). With that config disabled, rolled-back versions may never be deleted automatically and can accumulate indefinitely. Consider either not skipping ROLLED_BACK here when the cleanup service is disabled, or introducing/enabling a separate cleanup path/config specifically for rolled-back retention.

Suggested change

if (VersionStatus.isVersionRolledBack(version.getStatus())) {

// ROLLED_BACK versions are retained and cleaned up by StoreBackupVersionCleanupService

// with a dedicated retention period. Skip them here so the retention window is honored.

// Note: if the backup version retention-based cleanup service is disabled for the cluster,

// ROLLED_BACK versions will not be automatically deleted by this path. The admin tool's

// deleteOldVersion can still be used for manual cleanup in that case.

continue;

} else if (VersionStatus.canDelete(version.getStatus())) { // ERROR and KILLED versions are always deleted

if (VersionStatus.canDelete(version.getStatus()) || VersionStatus.isVersionRolledBack(version.getStatus())) {

// ERROR and KILLED versions are always deleted.

// Also include ROLLED_BACK versions here so they are not retained indefinitely when

// backup version retention-based cleanup is disabled.

Copilot · 2026-04-13T22:44:43Z

+    Set<String> targetedRegions = parseRegionsFilterList(regionFilter);
+    if (!targetedRegions.isEmpty()) {
+      Set<String> unknownRegions = new HashSet<>(targetedRegions);
+      unknownRegions.removeAll(controllerClients.keySet());
+      if (!unknownRegions.isEmpty()) {
+        LOGGER.warn(
+            "Region filter for rollback of store {} contains unknown regions: {}. Known regions: {}",
+            storeName,
+            unknownRegions,
+            controllerClients.keySet());
+        targetedRegions.removeAll(unknownRegions);
+      }
+    }
+
+    // Always poll ALL regions to detect cumulative rollback state. An operator may roll back
+    // regions one at a time; polling only targeted regions would miss that all regions are now
+    // rolled back and incorrectly leave the parent as PARTIALLY_ONLINE.
+    Set<String> allRegions = new HashSet<>(controllerClients.keySet());
+
+    // For targeted regions, the admin message was already consumed — if they're unreachable during
+    // polling, we can safely assume they rolled back. For non-targeted regions (being checked for
+    // cumulative rollback state from prior rollbacks), we cannot make that assumption.
+    Set<String> assumeRolledBackIfUnreachable = targetedRegions.isEmpty() ? allRegions : targetedRegions;
+


If regionFilter only contains unknown regions, targetedRegions becomes empty after removal, but assumeRolledBackIfUnreachable then falls back to allRegions. That makes an invalid/unknown filter behave like a full-cluster rollback when polling (counting unreachable/error regions as rolled back), which can incorrectly update the parent status. Consider treating the 'filter provided but no known regions remain' case as an error/early return, and only using the allRegions fallback when the filter is actually empty.

Copilot · 2026-04-13T22:44:43Z

+    // Otherwise mark as PARTIALLY_ONLINE.
+    VersionStatus parentStatus = rolledBackRegionCount == allRegions.size() ? ROLLED_BACK : PARTIALLY_ONLINE;
+
+    HelixVeniceClusterResources resources = getVeniceHelixAdmin().getHelixVeniceClusterResources(clusterName);
+    try (AutoCloseableLock ignore = resources.getClusterLockManager().createStoreWriteLock(storeName)) {
+      ReadWriteStoreRepository repository = resources.getStoreMetadataRepository();
+      Store store = repository.getStore(storeName);
+      store.updateVersionStatus(rolledBackVersionNum, parentStatus);
+      repository.updateStore(store);
+      LOGGER.info(
+          "Updated parent store {} version {} status to {} after rollback ({}/{} regions confirmed ROLLED_BACK)",
+          storeName,
+          rolledBackVersionNum,
+          parentStatus,
+          rolledBackRegionCount,
+          allRegions.size());
+    } catch (Exception e) {
+      LOGGER.error(
+          "Failed to update parent store {} version {} status to {} after rollback",
+          storeName,
+          rolledBackVersionNum,
+          parentStatus,


updateParentVersionStatusAfterRollback() sets the parent status to PARTIALLY_ONLINE whenever not all regions report ROLLED_BACK. If polling returns 0 (e.g., rollback didn’t actually take effect anywhere yet, or ZK propagation lag exceeded retries), this will still downgrade the parent version status from ONLINE to PARTIALLY_ONLINE. Consider only setting PARTIALLY_ONLINE when at least one region confirms ROLLED_BACK; otherwise keep the existing status and log that rollback confirmation was not observed.

Suggested change

// Otherwise mark as PARTIALLY_ONLINE.

VersionStatus parentStatus = rolledBackRegionCount == allRegions.size() ? ROLLED_BACK : PARTIALLY_ONLINE;

HelixVeniceClusterResources resources = getVeniceHelixAdmin().getHelixVeniceClusterResources(clusterName);

try (AutoCloseableLock ignore = resources.getClusterLockManager().createStoreWriteLock(storeName)) {

ReadWriteStoreRepository repository = resources.getStoreMetadataRepository();

Store store = repository.getStore(storeName);

store.updateVersionStatus(rolledBackVersionNum, parentStatus);

repository.updateStore(store);

LOGGER.info(

"Updated parent store {} version {} status to {} after rollback ({}/{} regions confirmed ROLLED_BACK)",

storeName,

rolledBackVersionNum,

parentStatus,

rolledBackRegionCount,

allRegions.size());

} catch (Exception e) {

LOGGER.error(

"Failed to update parent store {} version {} status to {} after rollback",

storeName,

rolledBackVersionNum,

parentStatus,

// If some, but not all, regions are rolled back, mark as PARTIALLY_ONLINE.

// If no region confirmed rollback, keep the existing parent status unchanged.

HelixVeniceClusterResources resources = getVeniceHelixAdmin().getHelixVeniceClusterResources(clusterName);

try (AutoCloseableLock ignore = resources.getClusterLockManager().createStoreWriteLock(storeName)) {

ReadWriteStoreRepository repository = resources.getStoreMetadataRepository();

Store store = repository.getStore(storeName);

if (rolledBackRegionCount == allRegions.size()) {

VersionStatus parentStatus = ROLLED_BACK;

store.updateVersionStatus(rolledBackVersionNum, parentStatus);

repository.updateStore(store);

LOGGER.info(

"Updated parent store {} version {} status to {} after rollback ({}/{} regions confirmed ROLLED_BACK)",

storeName,

rolledBackVersionNum,

parentStatus,

rolledBackRegionCount,

allRegions.size());

} else if (rolledBackRegionCount > 0) {

VersionStatus parentStatus = PARTIALLY_ONLINE;

store.updateVersionStatus(rolledBackVersionNum, parentStatus);

repository.updateStore(store);

LOGGER.info(

"Updated parent store {} version {} status to {} after rollback ({}/{} regions confirmed ROLLED_BACK)",

storeName,

rolledBackVersionNum,

parentStatus,

rolledBackRegionCount,

allRegions.size());

} else {

LOGGER.warn(

"No child region confirmed ROLLED_BACK for store {} version {} after rollback; leaving parent version status unchanged",

storeName,

rolledBackVersionNum);

}

} catch (Exception e) {

LOGGER.error(

"Failed to update parent store {} version {} status after rollback ({} of {} regions confirmed ROLLED_BACK)",

storeName,

rolledBackVersionNum,

rolledBackRegionCount,

allRegions.size(),

…opilot comments Adds `checkRollbackOriginVersionCapacityForNewPush` which rejects a new push while any ROLLED_BACK (or parent-side rollback-origin PARTIALLY_ONLINE) version is still within its retention window. The check runs on both the parent (`VeniceParentHelixAdmin.incrementVersionIdempotent`) and the child (`VeniceHelixAdmin.addVersion`), so the rejection surfaces synchronously to the VPJ instead of failing only in async admin-message consumption on the child. Integration tests cover both the block (within retention) and the release (after retention expires) paths. Also addresses Copilot comments on PR linkedin#2688: - linkedin#11 `assumeRolledBackIfUnreachable` is now empty for full-cluster rollbacks so unreachable regions don't inflate the ROLLED_BACK count. - linkedin#14 If the region filter contains only unknown regions, skip the parent status update instead of falling through into full-cluster behavior. - linkedin#15 If zero regions confirm ROLLED_BACK, leave parent status unchanged rather than downgrading to PARTIALLY_ONLINE on no evidence.

misyel requested a review from Copilot April 2, 2026 22:20

Copilot started reviewing on behalf of misyel April 2, 2026 22:20 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

misyel marked this pull request as ready for review April 3, 2026 17:27

Copilot AI review requested due to automatic review settings April 3, 2026 17:27

Copilot started reviewing on behalf of misyel April 3, 2026 17:28 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

mynameborat requested changes Apr 10, 2026

View reviewed changes

majisourav99 reviewed Apr 10, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 13, 2026 21:38

Copilot started reviewing on behalf of misyel April 13, 2026 21:38 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 13, 2026 22:39

Copilot started reviewing on behalf of misyel April 13, 2026 22:40 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

misyel added 10 commits April 22, 2026 11:30

mark roll back versions as ROLLED_BACK instead of ERROR

e36f48b

revert unneeded files

d1a7917

use executeWithMaxAttemptAndExponentialBackoff

fbcf5da

address copilot comments

4868d54

fix failing test

084f10b

address review comments

eca2625

handle poll failures for target rollback regions

193b719

fix failing tests

3fa8447

address copilot comments and fix metrics test

52b8439

misyel force-pushed the mkwong/rolledback-status branch from c3a5e87 to 9292ce9 Compare April 22, 2026 23:43

Conversation

misyel commented Apr 2, 2026

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mynameborat commented Apr 10, 2026

Uh oh!

mynameborat Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

misyel Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

majisourav99 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

misyel Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

misyel commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants