Skip to content

[controller] Mark rollback versions as ROLLED_BACK instead of ERROR#2688

Open
misyel wants to merge 10 commits intolinkedin:mainfrom
misyel:mkwong/rolledback-status
Open

[controller] Mark rollback versions as ROLLED_BACK instead of ERROR#2688
misyel wants to merge 10 commits intolinkedin:mainfrom
misyel:mkwong/rolledback-status

Conversation

@misyel
Copy link
Copy Markdown
Contributor

@misyel misyel commented Apr 2, 2026

Problem Statement

When a version is rolled back in Venice, the controller marks it as ERROR. This is semantically incorrect — a rollback is an intentional operational action, not a failure. It also causes the rolled-back version to be deleted immediately by the backup version cleanup service (since ERROR versions pass canDelete()), removing the ability to roll forward to that version if the rollback was premature.

There is no way to distinguish between a version that genuinely failed and one that was intentionally rolled back, and no retention window to allow operators to reverse a rollback decision.

Solution

Introduce a new ROLLED_BACK version status that accurately represents versions that were previously ONLINE but have been rolled back.

Status tracking:

  • Add ROLLED_BACK(8) to the VersionStatus enum and gRPC proto
  • Child controllers mark the previous version as ROLLED_BACK instead of ERROR during rollback
  • Parent controller polls child regions (with exponential backoff) to set the parent version to ROLLED_BACK (full rollback) or PARTIALLY_ONLINE (partial/region-filtered rollback)

Retention-based cleanup:

  • ROLLED_BACK versions are retained for 24 hours (configurable) before deletion, giving operators a window to roll forward
  • A dedicated cleanupRolledBackVersions method runs independently of the normal backup cleanup path
  • ROLLED_BACK versions are fully excluded from the normal cleanup path — version count checks, repush chain logic, broken chain fallback, and AbstractStore.retrieveVersionsToDelete all skip them
  • If all delete attempts fail, the method returns false so normal cleanup isn't suppressed for the cycle

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

New unit tests

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new ROLLED_BACK version status to distinguish intentional rollbacks from genuine failures, and updates controller rollback + cleanup logic so rolled-back versions are retained (instead of being immediately treated as deletable ERROR versions).

Changes:

  • Added ROLLED_BACK to VersionStatus and controller gRPC proto enum.
  • Updated child rollback logic to mark the previous current version as ROLLED_BACK, and added parent polling to set the parent version status to ROLLED_BACK or PARTIALLY_ONLINE.
  • Updated backup version cleanup to exclude rolled-back versions from normal deletion and to delete them via a separate retention-based path; added unit tests for this behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java Child controller rollback marks the rolled-back version as ROLLED_BACK (was ERROR).
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java Parent controller polls child regions post-rollback and updates parent version status to ROLLED_BACK/PARTIALLY_ONLINE.
services/venice-controller/src/main/java/com/linkedin/venice/controller/StoreBackupVersionCleanupService.java Adds a rolled-back retention cleanup path and excludes rolled-back versions from normal retention logic.
services/venice-controller/src/test/java/com/linkedin/venice/controller/TestStoreBackupVersionCleanupService.java Adds unit tests covering rolled-back retention behavior and exclusion from normal cleanup readiness/deletion.
internal/venice-common/src/main/java/com/linkedin/venice/meta/VersionStatus.java Adds ROLLED_BACK(8) and a helper isVersionRolledBack.
internal/venice-common/src/main/java/com/linkedin/venice/meta/AbstractStore.java Prevents ROLLED_BACK versions from being deleted via retrieveVersionsToDelete.
internal/venice-common/src/main/proto/controller/ControllerGrpcRequestContext.proto Adds ROLLED_BACK = 8 to VersionStatusGrpc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@misyel misyel marked this pull request as ready for review April 3, 2026 17:27
Copilot AI review requested due to automatic review settings April 3, 2026 17:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/venice-common/src/main/java/com/linkedin/venice/meta/VersionStatus.java Outdated
@mynameborat
Copy link
Copy Markdown
Contributor

I think the idea to have ROLLED_BACK state separately is fine. I am worried about another code path and use case to account for in the StoreBackupVersionCleanupService. Can we not consolidate the cleanup behavior to any other versions?

Please share any insights into this decision of separating out.

/**
* Version was previously ONLINE but has been rolled back. It is no longer serving read requests.
*/
ROLLED_BACK(8);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During a mixed deployment window, an old controller reading a store with status 8 will hit a value it doesn't recognize.

What does the deserialization path (presumably VersionStatus.fromValue(int)) do with an unknown value today — does it throw, return null, or have a fallback?

Can you trace through the deserialization path and confirm it's safe, or add an UNKNOWN fallback for forward compatibility before rolling this out?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - it will cause deserialization errors since there is no fallback value. I defaulted to NOT_CREATED for de-serialization exceptions because adding an UNKNOWN status will also run into this issue unless UNKNOWN is deployed fully before ROLLED_BACK is deployed

* Retention time in milliseconds for rolled-back versions before they are eligible for deletion.
* Defaults to 24 hours.
*/
public static final String CONTROLLER_ROLLED_BACK_VERSION_RETENTION_MS =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be we want to consolidate the min retention time with this ? otherwise there will be too many retention times

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout on config proliferation. I considered consolidating with MIN_CLEANUP_DELAY_MS but they serve different purposes:

  • MIN_CLEANUP_DELAY_MS (default 1h) is a safety floor on the entire cleanup path — it prevents any backup version from being deleted before routers/servers have time to switch to the new current version. It's a store-wide gate, not a per-version retention. If we align this one to 24hr too, we won't be able to delete any store version for 24hrs which may be too long
  • ROLLED_BACK_VERSION_RETENTION_MS (default 24h) is a per-version retention window — it keeps rolled-back versions around so operators have time to reverse a premature rollback.

Copilot AI review requested due to automatic review settings April 13, 2026 21:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@misyel
Copy link
Copy Markdown
Contributor Author

misyel commented Apr 13, 2026

I think the idea to have ROLLED_BACK state separately is fine. I am worried about another code path and use case to account for in the StoreBackupVersionCleanupService. Can we not consolidate the cleanup behavior to any other versions?

@mynameborat Thanks for raising this — the concern about maintaining a separate code path is fair.

Here's how the existing cleanup paths work today and why ROLLED_BACK doesn't fit cleanly into either:

  • Path 1 — Immediate deletion (canDelete() → ERROR/KILLED): Versions with canDelete() == true are collected first and deleted after only the minBackupVersionCleanupDelay (default 1h). This is too aggressive for ROLLED_BACK as DVC needs time to rebootstrap the future version and CDC needs time to get the new checkpoints

  • Path 2 — Standard backup retention (ONLINE versions): Gated by multiple preconditions: readiness count (>1 non-current version), per-store or default retention (7 days), repush chain logic, and optional router/server metadata validation. This is too long for ROLLED_BACK (7 days vs the 24h we want), and the readiness count / repush chain logic doesn't apply — a rolled-back version isn't a "backup" in the same sense

  • Current approach — Dedicated ROLLED_BACK retention: A separate cleanupRolledBackVersions method with its own configurable retention (default 24h) that runs before the standard paths. ROLLED_BACK is excluded from the normal paths via filters in the readiness count, repush chain and broken chain fallback

Copilot AI review requested due to automatic review settings April 13, 2026 22:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +328 to +335
if (VersionStatus.isVersionRolledBack(version.getStatus())) {
// ROLLED_BACK versions are retained and cleaned up by StoreBackupVersionCleanupService
// with a dedicated retention period. Skip them here so the retention window is honored.
// Note: if the backup version retention-based cleanup service is disabled for the cluster,
// ROLLED_BACK versions will not be automatically deleted by this path. The admin tool's
// deleteOldVersion can still be used for manual cleanup in that case.
continue;
} else if (VersionStatus.canDelete(version.getStatus())) { // ERROR and KILLED versions are always deleted
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROLLED_BACK versions are unconditionally skipped from retrieveVersionsToDelete(), but the only automatic cleanup path for them appears to be StoreBackupVersionCleanupService, which is gated by controller.backup.version.retention.based.cleanup.enabled (default false). With that config disabled, rolled-back versions may never be deleted automatically and can accumulate indefinitely. Consider either not skipping ROLLED_BACK here when the cleanup service is disabled, or introducing/enabling a separate cleanup path/config specifically for rolled-back retention.

Suggested change
if (VersionStatus.isVersionRolledBack(version.getStatus())) {
// ROLLED_BACK versions are retained and cleaned up by StoreBackupVersionCleanupService
// with a dedicated retention period. Skip them here so the retention window is honored.
// Note: if the backup version retention-based cleanup service is disabled for the cluster,
// ROLLED_BACK versions will not be automatically deleted by this path. The admin tool's
// deleteOldVersion can still be used for manual cleanup in that case.
continue;
} else if (VersionStatus.canDelete(version.getStatus())) { // ERROR and KILLED versions are always deleted
if (VersionStatus.canDelete(version.getStatus()) || VersionStatus.isVersionRolledBack(version.getStatus())) {
// ERROR and KILLED versions are always deleted.
// Also include ROLLED_BACK versions here so they are not retained indefinitely when
// backup version retention-based cleanup is disabled.

Copilot uses AI. Check for mistakes.
Comment on lines +2514 to +2537
Set<String> targetedRegions = parseRegionsFilterList(regionFilter);
if (!targetedRegions.isEmpty()) {
Set<String> unknownRegions = new HashSet<>(targetedRegions);
unknownRegions.removeAll(controllerClients.keySet());
if (!unknownRegions.isEmpty()) {
LOGGER.warn(
"Region filter for rollback of store {} contains unknown regions: {}. Known regions: {}",
storeName,
unknownRegions,
controllerClients.keySet());
targetedRegions.removeAll(unknownRegions);
}
}

// Always poll ALL regions to detect cumulative rollback state. An operator may roll back
// regions one at a time; polling only targeted regions would miss that all regions are now
// rolled back and incorrectly leave the parent as PARTIALLY_ONLINE.
Set<String> allRegions = new HashSet<>(controllerClients.keySet());

// For targeted regions, the admin message was already consumed — if they're unreachable during
// polling, we can safely assume they rolled back. For non-targeted regions (being checked for
// cumulative rollback state from prior rollbacks), we cannot make that assumption.
Set<String> assumeRolledBackIfUnreachable = targetedRegions.isEmpty() ? allRegions : targetedRegions;

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If regionFilter only contains unknown regions, targetedRegions becomes empty after removal, but assumeRolledBackIfUnreachable then falls back to allRegions. That makes an invalid/unknown filter behave like a full-cluster rollback when polling (counting unreachable/error regions as rolled back), which can incorrectly update the parent status. Consider treating the 'filter provided but no known regions remain' case as an error/early return, and only using the allRegions fallback when the filter is actually empty.

Copilot uses AI. Check for mistakes.
Comment on lines +2581 to +2602
// Otherwise mark as PARTIALLY_ONLINE.
VersionStatus parentStatus = rolledBackRegionCount == allRegions.size() ? ROLLED_BACK : PARTIALLY_ONLINE;

HelixVeniceClusterResources resources = getVeniceHelixAdmin().getHelixVeniceClusterResources(clusterName);
try (AutoCloseableLock ignore = resources.getClusterLockManager().createStoreWriteLock(storeName)) {
ReadWriteStoreRepository repository = resources.getStoreMetadataRepository();
Store store = repository.getStore(storeName);
store.updateVersionStatus(rolledBackVersionNum, parentStatus);
repository.updateStore(store);
LOGGER.info(
"Updated parent store {} version {} status to {} after rollback ({}/{} regions confirmed ROLLED_BACK)",
storeName,
rolledBackVersionNum,
parentStatus,
rolledBackRegionCount,
allRegions.size());
} catch (Exception e) {
LOGGER.error(
"Failed to update parent store {} version {} status to {} after rollback",
storeName,
rolledBackVersionNum,
parentStatus,
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateParentVersionStatusAfterRollback() sets the parent status to PARTIALLY_ONLINE whenever not all regions report ROLLED_BACK. If polling returns 0 (e.g., rollback didn’t actually take effect anywhere yet, or ZK propagation lag exceeded retries), this will still downgrade the parent version status from ONLINE to PARTIALLY_ONLINE. Consider only setting PARTIALLY_ONLINE when at least one region confirms ROLLED_BACK; otherwise keep the existing status and log that rollback confirmation was not observed.

Suggested change
// Otherwise mark as PARTIALLY_ONLINE.
VersionStatus parentStatus = rolledBackRegionCount == allRegions.size() ? ROLLED_BACK : PARTIALLY_ONLINE;
HelixVeniceClusterResources resources = getVeniceHelixAdmin().getHelixVeniceClusterResources(clusterName);
try (AutoCloseableLock ignore = resources.getClusterLockManager().createStoreWriteLock(storeName)) {
ReadWriteStoreRepository repository = resources.getStoreMetadataRepository();
Store store = repository.getStore(storeName);
store.updateVersionStatus(rolledBackVersionNum, parentStatus);
repository.updateStore(store);
LOGGER.info(
"Updated parent store {} version {} status to {} after rollback ({}/{} regions confirmed ROLLED_BACK)",
storeName,
rolledBackVersionNum,
parentStatus,
rolledBackRegionCount,
allRegions.size());
} catch (Exception e) {
LOGGER.error(
"Failed to update parent store {} version {} status to {} after rollback",
storeName,
rolledBackVersionNum,
parentStatus,
// If some, but not all, regions are rolled back, mark as PARTIALLY_ONLINE.
// If no region confirmed rollback, keep the existing parent status unchanged.
HelixVeniceClusterResources resources = getVeniceHelixAdmin().getHelixVeniceClusterResources(clusterName);
try (AutoCloseableLock ignore = resources.getClusterLockManager().createStoreWriteLock(storeName)) {
ReadWriteStoreRepository repository = resources.getStoreMetadataRepository();
Store store = repository.getStore(storeName);
if (rolledBackRegionCount == allRegions.size()) {
VersionStatus parentStatus = ROLLED_BACK;
store.updateVersionStatus(rolledBackVersionNum, parentStatus);
repository.updateStore(store);
LOGGER.info(
"Updated parent store {} version {} status to {} after rollback ({}/{} regions confirmed ROLLED_BACK)",
storeName,
rolledBackVersionNum,
parentStatus,
rolledBackRegionCount,
allRegions.size());
} else if (rolledBackRegionCount > 0) {
VersionStatus parentStatus = PARTIALLY_ONLINE;
store.updateVersionStatus(rolledBackVersionNum, parentStatus);
repository.updateStore(store);
LOGGER.info(
"Updated parent store {} version {} status to {} after rollback ({}/{} regions confirmed ROLLED_BACK)",
storeName,
rolledBackVersionNum,
parentStatus,
rolledBackRegionCount,
allRegions.size());
} else {
LOGGER.warn(
"No child region confirmed ROLLED_BACK for store {} version {} after rollback; leaving parent version status unchanged",
storeName,
rolledBackVersionNum);
}
} catch (Exception e) {
LOGGER.error(
"Failed to update parent store {} version {} status after rollback ({} of {} regions confirmed ROLLED_BACK)",
storeName,
rolledBackVersionNum,
rolledBackRegionCount,
allRegions.size(),

Copilot uses AI. Check for mistakes.
misyel added 10 commits April 22, 2026 11:30
…opilot

comments

Adds `checkRollbackOriginVersionCapacityForNewPush` which rejects a new push
while any ROLLED_BACK (or parent-side rollback-origin PARTIALLY_ONLINE)
version
is still within its retention window. The check runs on both the parent
(`VeniceParentHelixAdmin.incrementVersionIdempotent`) and the child
(`VeniceHelixAdmin.addVersion`), so the rejection surfaces synchronously to
the VPJ instead of failing only in async admin-message consumption on the
child.

Integration tests cover both the block (within retention) and the release
(after retention expires) paths.

Also addresses Copilot comments on PR linkedin#2688:
- linkedin#11 `assumeRolledBackIfUnreachable` is now empty for full-cluster rollbacks
  so unreachable regions don't inflate the ROLLED_BACK count.
- linkedin#14 If the region filter contains only unknown regions, skip the parent
  status update instead of falling through into full-cluster behavior.
- linkedin#15 If zero regions confirm ROLLED_BACK, leave parent status unchanged
  rather than downgrading to PARTIALLY_ONLINE on no evidence.
@misyel misyel force-pushed the mkwong/rolledback-status branch from c3a5e87 to 9292ce9 Compare April 22, 2026 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants