[controller] Mark rollback versions as ROLLED_BACK instead of ERROR#2688
[controller] Mark rollback versions as ROLLED_BACK instead of ERROR#2688misyel wants to merge 10 commits intolinkedin:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new ROLLED_BACK version status to distinguish intentional rollbacks from genuine failures, and updates controller rollback + cleanup logic so rolled-back versions are retained (instead of being immediately treated as deletable ERROR versions).
Changes:
- Added
ROLLED_BACKtoVersionStatusand controller gRPC proto enum. - Updated child rollback logic to mark the previous current version as
ROLLED_BACK, and added parent polling to set the parent version status toROLLED_BACKorPARTIALLY_ONLINE. - Updated backup version cleanup to exclude rolled-back versions from normal deletion and to delete them via a separate retention-based path; added unit tests for this behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java | Child controller rollback marks the rolled-back version as ROLLED_BACK (was ERROR). |
| services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java | Parent controller polls child regions post-rollback and updates parent version status to ROLLED_BACK/PARTIALLY_ONLINE. |
| services/venice-controller/src/main/java/com/linkedin/venice/controller/StoreBackupVersionCleanupService.java | Adds a rolled-back retention cleanup path and excludes rolled-back versions from normal retention logic. |
| services/venice-controller/src/test/java/com/linkedin/venice/controller/TestStoreBackupVersionCleanupService.java | Adds unit tests covering rolled-back retention behavior and exclusion from normal cleanup readiness/deletion. |
| internal/venice-common/src/main/java/com/linkedin/venice/meta/VersionStatus.java | Adds ROLLED_BACK(8) and a helper isVersionRolledBack. |
| internal/venice-common/src/main/java/com/linkedin/venice/meta/AbstractStore.java | Prevents ROLLED_BACK versions from being deleted via retrieveVersionsToDelete. |
| internal/venice-common/src/main/proto/controller/ControllerGrpcRequestContext.proto | Adds ROLLED_BACK = 8 to VersionStatusGrpc. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I think the idea to have Please share any insights into this decision of separating out. |
| /** | ||
| * Version was previously ONLINE but has been rolled back. It is no longer serving read requests. | ||
| */ | ||
| ROLLED_BACK(8); |
There was a problem hiding this comment.
During a mixed deployment window, an old controller reading a store with status 8 will hit a value it doesn't recognize.
What does the deserialization path (presumably VersionStatus.fromValue(int)) do with an unknown value today — does it throw, return null, or have a fallback?
Can you trace through the deserialization path and confirm it's safe, or add an UNKNOWN fallback for forward compatibility before rolling this out?
There was a problem hiding this comment.
Good catch - it will cause deserialization errors since there is no fallback value. I defaulted to NOT_CREATED for de-serialization exceptions because adding an UNKNOWN status will also run into this issue unless UNKNOWN is deployed fully before ROLLED_BACK is deployed
| * Retention time in milliseconds for rolled-back versions before they are eligible for deletion. | ||
| * Defaults to 24 hours. | ||
| */ | ||
| public static final String CONTROLLER_ROLLED_BACK_VERSION_RETENTION_MS = |
There was a problem hiding this comment.
may be we want to consolidate the min retention time with this ? otherwise there will be too many retention times
There was a problem hiding this comment.
Good callout on config proliferation. I considered consolidating with MIN_CLEANUP_DELAY_MS but they serve different purposes:
- MIN_CLEANUP_DELAY_MS (default 1h) is a safety floor on the entire cleanup path — it prevents any backup version from being deleted before routers/servers have time to switch to the new current version. It's a store-wide gate, not a per-version retention. If we align this one to 24hr too, we won't be able to delete any store version for 24hrs which may be too long
- ROLLED_BACK_VERSION_RETENTION_MS (default 24h) is a per-version retention window — it keeps rolled-back versions around so operators have time to reverse a premature rollback.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
@mynameborat Thanks for raising this — the concern about maintaining a separate code path is fair. Here's how the existing cleanup paths work today and why ROLLED_BACK doesn't fit cleanly into either:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (VersionStatus.isVersionRolledBack(version.getStatus())) { | ||
| // ROLLED_BACK versions are retained and cleaned up by StoreBackupVersionCleanupService | ||
| // with a dedicated retention period. Skip them here so the retention window is honored. | ||
| // Note: if the backup version retention-based cleanup service is disabled for the cluster, | ||
| // ROLLED_BACK versions will not be automatically deleted by this path. The admin tool's | ||
| // deleteOldVersion can still be used for manual cleanup in that case. | ||
| continue; | ||
| } else if (VersionStatus.canDelete(version.getStatus())) { // ERROR and KILLED versions are always deleted |
There was a problem hiding this comment.
ROLLED_BACK versions are unconditionally skipped from retrieveVersionsToDelete(), but the only automatic cleanup path for them appears to be StoreBackupVersionCleanupService, which is gated by controller.backup.version.retention.based.cleanup.enabled (default false). With that config disabled, rolled-back versions may never be deleted automatically and can accumulate indefinitely. Consider either not skipping ROLLED_BACK here when the cleanup service is disabled, or introducing/enabling a separate cleanup path/config specifically for rolled-back retention.
| if (VersionStatus.isVersionRolledBack(version.getStatus())) { | |
| // ROLLED_BACK versions are retained and cleaned up by StoreBackupVersionCleanupService | |
| // with a dedicated retention period. Skip them here so the retention window is honored. | |
| // Note: if the backup version retention-based cleanup service is disabled for the cluster, | |
| // ROLLED_BACK versions will not be automatically deleted by this path. The admin tool's | |
| // deleteOldVersion can still be used for manual cleanup in that case. | |
| continue; | |
| } else if (VersionStatus.canDelete(version.getStatus())) { // ERROR and KILLED versions are always deleted | |
| if (VersionStatus.canDelete(version.getStatus()) || VersionStatus.isVersionRolledBack(version.getStatus())) { | |
| // ERROR and KILLED versions are always deleted. | |
| // Also include ROLLED_BACK versions here so they are not retained indefinitely when | |
| // backup version retention-based cleanup is disabled. |
| Set<String> targetedRegions = parseRegionsFilterList(regionFilter); | ||
| if (!targetedRegions.isEmpty()) { | ||
| Set<String> unknownRegions = new HashSet<>(targetedRegions); | ||
| unknownRegions.removeAll(controllerClients.keySet()); | ||
| if (!unknownRegions.isEmpty()) { | ||
| LOGGER.warn( | ||
| "Region filter for rollback of store {} contains unknown regions: {}. Known regions: {}", | ||
| storeName, | ||
| unknownRegions, | ||
| controllerClients.keySet()); | ||
| targetedRegions.removeAll(unknownRegions); | ||
| } | ||
| } | ||
|
|
||
| // Always poll ALL regions to detect cumulative rollback state. An operator may roll back | ||
| // regions one at a time; polling only targeted regions would miss that all regions are now | ||
| // rolled back and incorrectly leave the parent as PARTIALLY_ONLINE. | ||
| Set<String> allRegions = new HashSet<>(controllerClients.keySet()); | ||
|
|
||
| // For targeted regions, the admin message was already consumed — if they're unreachable during | ||
| // polling, we can safely assume they rolled back. For non-targeted regions (being checked for | ||
| // cumulative rollback state from prior rollbacks), we cannot make that assumption. | ||
| Set<String> assumeRolledBackIfUnreachable = targetedRegions.isEmpty() ? allRegions : targetedRegions; | ||
|
|
There was a problem hiding this comment.
If regionFilter only contains unknown regions, targetedRegions becomes empty after removal, but assumeRolledBackIfUnreachable then falls back to allRegions. That makes an invalid/unknown filter behave like a full-cluster rollback when polling (counting unreachable/error regions as rolled back), which can incorrectly update the parent status. Consider treating the 'filter provided but no known regions remain' case as an error/early return, and only using the allRegions fallback when the filter is actually empty.
| // Otherwise mark as PARTIALLY_ONLINE. | ||
| VersionStatus parentStatus = rolledBackRegionCount == allRegions.size() ? ROLLED_BACK : PARTIALLY_ONLINE; | ||
|
|
||
| HelixVeniceClusterResources resources = getVeniceHelixAdmin().getHelixVeniceClusterResources(clusterName); | ||
| try (AutoCloseableLock ignore = resources.getClusterLockManager().createStoreWriteLock(storeName)) { | ||
| ReadWriteStoreRepository repository = resources.getStoreMetadataRepository(); | ||
| Store store = repository.getStore(storeName); | ||
| store.updateVersionStatus(rolledBackVersionNum, parentStatus); | ||
| repository.updateStore(store); | ||
| LOGGER.info( | ||
| "Updated parent store {} version {} status to {} after rollback ({}/{} regions confirmed ROLLED_BACK)", | ||
| storeName, | ||
| rolledBackVersionNum, | ||
| parentStatus, | ||
| rolledBackRegionCount, | ||
| allRegions.size()); | ||
| } catch (Exception e) { | ||
| LOGGER.error( | ||
| "Failed to update parent store {} version {} status to {} after rollback", | ||
| storeName, | ||
| rolledBackVersionNum, | ||
| parentStatus, |
There was a problem hiding this comment.
updateParentVersionStatusAfterRollback() sets the parent status to PARTIALLY_ONLINE whenever not all regions report ROLLED_BACK. If polling returns 0 (e.g., rollback didn’t actually take effect anywhere yet, or ZK propagation lag exceeded retries), this will still downgrade the parent version status from ONLINE to PARTIALLY_ONLINE. Consider only setting PARTIALLY_ONLINE when at least one region confirms ROLLED_BACK; otherwise keep the existing status and log that rollback confirmation was not observed.
| // Otherwise mark as PARTIALLY_ONLINE. | |
| VersionStatus parentStatus = rolledBackRegionCount == allRegions.size() ? ROLLED_BACK : PARTIALLY_ONLINE; | |
| HelixVeniceClusterResources resources = getVeniceHelixAdmin().getHelixVeniceClusterResources(clusterName); | |
| try (AutoCloseableLock ignore = resources.getClusterLockManager().createStoreWriteLock(storeName)) { | |
| ReadWriteStoreRepository repository = resources.getStoreMetadataRepository(); | |
| Store store = repository.getStore(storeName); | |
| store.updateVersionStatus(rolledBackVersionNum, parentStatus); | |
| repository.updateStore(store); | |
| LOGGER.info( | |
| "Updated parent store {} version {} status to {} after rollback ({}/{} regions confirmed ROLLED_BACK)", | |
| storeName, | |
| rolledBackVersionNum, | |
| parentStatus, | |
| rolledBackRegionCount, | |
| allRegions.size()); | |
| } catch (Exception e) { | |
| LOGGER.error( | |
| "Failed to update parent store {} version {} status to {} after rollback", | |
| storeName, | |
| rolledBackVersionNum, | |
| parentStatus, | |
| // If some, but not all, regions are rolled back, mark as PARTIALLY_ONLINE. | |
| // If no region confirmed rollback, keep the existing parent status unchanged. | |
| HelixVeniceClusterResources resources = getVeniceHelixAdmin().getHelixVeniceClusterResources(clusterName); | |
| try (AutoCloseableLock ignore = resources.getClusterLockManager().createStoreWriteLock(storeName)) { | |
| ReadWriteStoreRepository repository = resources.getStoreMetadataRepository(); | |
| Store store = repository.getStore(storeName); | |
| if (rolledBackRegionCount == allRegions.size()) { | |
| VersionStatus parentStatus = ROLLED_BACK; | |
| store.updateVersionStatus(rolledBackVersionNum, parentStatus); | |
| repository.updateStore(store); | |
| LOGGER.info( | |
| "Updated parent store {} version {} status to {} after rollback ({}/{} regions confirmed ROLLED_BACK)", | |
| storeName, | |
| rolledBackVersionNum, | |
| parentStatus, | |
| rolledBackRegionCount, | |
| allRegions.size()); | |
| } else if (rolledBackRegionCount > 0) { | |
| VersionStatus parentStatus = PARTIALLY_ONLINE; | |
| store.updateVersionStatus(rolledBackVersionNum, parentStatus); | |
| repository.updateStore(store); | |
| LOGGER.info( | |
| "Updated parent store {} version {} status to {} after rollback ({}/{} regions confirmed ROLLED_BACK)", | |
| storeName, | |
| rolledBackVersionNum, | |
| parentStatus, | |
| rolledBackRegionCount, | |
| allRegions.size()); | |
| } else { | |
| LOGGER.warn( | |
| "No child region confirmed ROLLED_BACK for store {} version {} after rollback; leaving parent version status unchanged", | |
| storeName, | |
| rolledBackVersionNum); | |
| } | |
| } catch (Exception e) { | |
| LOGGER.error( | |
| "Failed to update parent store {} version {} status after rollback ({} of {} regions confirmed ROLLED_BACK)", | |
| storeName, | |
| rolledBackVersionNum, | |
| rolledBackRegionCount, | |
| allRegions.size(), |
…opilot comments Adds `checkRollbackOriginVersionCapacityForNewPush` which rejects a new push while any ROLLED_BACK (or parent-side rollback-origin PARTIALLY_ONLINE) version is still within its retention window. The check runs on both the parent (`VeniceParentHelixAdmin.incrementVersionIdempotent`) and the child (`VeniceHelixAdmin.addVersion`), so the rejection surfaces synchronously to the VPJ instead of failing only in async admin-message consumption on the child. Integration tests cover both the block (within retention) and the release (after retention expires) paths. Also addresses Copilot comments on PR linkedin#2688: - linkedin#11 `assumeRolledBackIfUnreachable` is now empty for full-cluster rollbacks so unreachable regions don't inflate the ROLLED_BACK count. - linkedin#14 If the region filter contains only unknown regions, skip the parent status update instead of falling through into full-cluster behavior. - linkedin#15 If zero regions confirm ROLLED_BACK, leave parent status unchanged rather than downgrading to PARTIALLY_ONLINE on no evidence.
c3a5e87 to
9292ce9
Compare
Problem Statement
When a version is rolled back in Venice, the controller marks it as ERROR. This is semantically incorrect — a rollback is an intentional operational action, not a failure. It also causes the rolled-back version to be deleted immediately by the backup version cleanup service (since ERROR versions pass canDelete()), removing the ability to roll forward to that version if the rollback was premature.
There is no way to distinguish between a version that genuinely failed and one that was intentionally rolled back, and no retention window to allow operators to reverse a rollback decision.
Solution
Introduce a new ROLLED_BACK version status that accurately represents versions that were previously ONLINE but have been rolled back.
Status tracking:
Retention-based cleanup:
Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
New unit tests
Does this PR introduce any user-facing or breaking changes?