feat: gate chainlink by validator mode#1227
Conversation
📝 WalkthroughWalkthroughAdds coordination-mode guards to HTTP handlers to reject write-like RPCs on replicas and short-circuits account/transaction ensure paths in replica mode. Refactors Chainlink into a runtime-backed lifecycle model with enable_primary/disable and runtime_fetch_cloner gating, and implements cooperative shutdown for FetchCloner and RemoteAccountProvider (including SubMuxClient and pubsub mocks). Reorders validator/replicator startup to run ordered primary/standby readiness sequences. Extensive tests and test-helper updates validate gating, lifecycle toggles, shutdowns, and sequencing. Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
magicblock-api/src/magic_validator.rs (1)
879-901:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon't treat
mode_tx.send(...)as a completed Primary transition.This only queues
SchedulerMode::Primary. The new HTTP/chainlink gates readCoordinationMode::current(), so startup can continue pastenable_primary()while the old mode is still visible. Please make the coordination-mode switch synchronous here, or wait for an acknowledgment before enabling primary-only behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@magicblock-api/src/magic_validator.rs` around lines 879 - 901, The code currently only enqueues SchedulerMode::Primary via mode_tx.send(...) which doesn't guarantee CoordinationMode::current() has switched before calling chainlink.enable_primary(); change this to perform a synchronous coordination-mode transition or wait for an explicit acknowledgment: either (a) invoke the coordination-mode switch synchronously (so CoordinationMode::current() reflects Primary before calling chainlink.enable_primary()), or (b) extend the scheduler handshake to return/await an ack after processing SchedulerMode::Primary (e.g., send a request that the scheduler replies to) and only call chainlink.enable_primary() after receiving that ack; update the logic around mode_tx.send, SchedulerMode::Primary, and chainlink.enable_primary to ensure the coordination mode is visible before enabling primary-only behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@magicblock-replicator/src/service/context.rs`:
- Around line 244-249: The promotion/demotion path currently enqueues
SchedulerMode inside enter_primary_mode()/enter_replica_mode() but silently
drops send failures, so the rest of
run_primary_readiness_sequence()/run_replica_readiness_sequence() can proceed
while CoordinationMode::current() is stale or the receiver is gone; change
enter_primary_mode and enter_replica_mode to return Result<(), Error> (or a
suitable error type) and have them propagate the failure of mode_tx.send(...)
instead of ignoring it, then update the calls in run_primary_readiness_sequence
and the analogous replica sequence to invoke those methods through the fallible
closures you already pass (e.g. replace || self.enter_primary_mode() with a
closure that returns the Result), so the readiness sequence aborts on
send/send-failure and the mode switch is externally visible; ensure
SchedulerMode send failures are converted into the same error type you return so
callers can handle them.
---
Outside diff comments:
In `@magicblock-api/src/magic_validator.rs`:
- Around line 879-901: The code currently only enqueues SchedulerMode::Primary
via mode_tx.send(...) which doesn't guarantee CoordinationMode::current() has
switched before calling chainlink.enable_primary(); change this to perform a
synchronous coordination-mode transition or wait for an explicit acknowledgment:
either (a) invoke the coordination-mode switch synchronously (so
CoordinationMode::current() reflects Primary before calling
chainlink.enable_primary()), or (b) extend the scheduler handshake to
return/await an ack after processing SchedulerMode::Primary (e.g., send a
request that the scheduler replies to) and only call chainlink.enable_primary()
after receiving that ack; update the logic around mode_tx.send,
SchedulerMode::Primary, and chainlink.enable_primary to ensure the coordination
mode is visible before enabling primary-only behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 43480cb3-ed7b-4cd8-8ece-e9efe4b990f5
📒 Files selected for processing (10)
magicblock-aperture/src/requests/http/mod.rsmagicblock-aperture/src/requests/http/send_transaction.rsmagicblock-aperture/src/requests/http/simulate_transaction.rsmagicblock-api/src/magic_validator.rsmagicblock-chainlink/src/chainlink/mod.rsmagicblock-chainlink/tests/10_non_primary_gating.rsmagicblock-chainlink/tests/utils/test_context.rsmagicblock-replicator/src/error.rsmagicblock-replicator/src/service/context.rsmagicblock-replicator/src/service/mod.rs
| run_primary_readiness_sequence( | ||
| || self.scheduler.wait_for_idle(), | ||
| || self.chainlink.reset_accounts_bank().map_err(Into::into), | ||
| || self.enter_primary_mode(), | ||
| || self.chainlink.enable_primary().map_err(Into::into), | ||
| ) |
There was a problem hiding this comment.
Promotion/demotion still race on stale coordination-mode visibility.
These helpers continue after enter_primary_mode() / enter_replica_mode(), but those methods only enqueue SchedulerMode on mode_tx and currently drop send failures. Since the new HTTP and chainlink gates consult CoordinationMode::current(), the node can keep behaving in the old mode—or proceed after the receiver is gone—while reset/enable/disable and consumer setup continue. Make the mode switch fallible and externally visible before running the rest of the sequence.
Also applies to: 264-268
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@magicblock-replicator/src/service/context.rs` around lines 244 - 249, The
promotion/demotion path currently enqueues SchedulerMode inside
enter_primary_mode()/enter_replica_mode() but silently drops send failures, so
the rest of run_primary_readiness_sequence()/run_replica_readiness_sequence()
can proceed while CoordinationMode::current() is stale or the receiver is gone;
change enter_primary_mode and enter_replica_mode to return Result<(), Error> (or
a suitable error type) and have them propagate the failure of mode_tx.send(...)
instead of ignoring it, then update the calls in run_primary_readiness_sequence
and the analogous replica sequence to invoke those methods through the fallible
closures you already pass (e.g. replace || self.enter_primary_mode() with a
closure that returns the Result), so the readiness sequence aborts on
send/send-failure and the mode switch is externally visible; ensure
SchedulerMode send failures are converted into the same error type you return so
callers can handle them.
* master: fix: CommitFinalize post-commit actions (#1220)
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@magicblock-api/src/magic_validator.rs`:
- Around line 133-135: When calling enable_primary().await before
send_scheduler_primary().await, ensure you roll back the primary enable if
send_scheduler_primary() fails: after a failed send_scheduler_primary(), call
the complementary disable_primary().await (or the appropriate shutdown/disable
function corresponding to enable_primary) to revert the partial startup, log any
rollback errors, and then return the original send error; apply the same
rollback pattern to the other occurrence around lines 912-933 where
enable_primary/send_scheduler_primary are paired.
In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs`:
- Around line 308-313: The fail_all_pending_clones helper currently calls
.expect() on the pending_clones mutex which can panic during shutdown; change it
to handle a poisoned mutex instead—either change fail_all_pending_clones to
return Result<(), E> (propagate an error) or recover the lock with
lock().unwrap_or_else(|poisoned| poisoned.into_inner()) and log the poisoning
before proceeding; update uses of pending_clones in fail_all_pending_clones and
any callers accordingly and ensure you reference the pending_clones Mutex and
the fail_all_pending_clones function name when making the change.
In `@magicblock-chainlink/src/chainlink/mod.rs`:
- Around line 610-620: The method runtime_fetch_cloner currently acquires
self.runtime.lock().await before checking lifecycle_state, which can block; add
a fast-path check using
ChainlinkLifecycleState::from_u8(self.lifecycle_state.load(Ordering::SeqCst)) !=
ChainlinkLifecycleState::Enabled to return None immediately if not Enabled, then
after acquiring the runtime guard re-check the lifecycle_state to avoid a race
handing out a FetchCloner from runtime.fetch_cloner.clone(); update
runtime_fetch_cloner to perform (1) an early lifecycle check, (2) lock
self.runtime, and (3) a second lifecycle check before returning the cloned
FetchCloner.
In `@magicblock-chainlink/src/remote_account_provider/mod.rs`:
- Around line 871-884: The teardown code currently calls .expect("... lock
poisoned") on the removed_account_rx and fetching_accounts mutexes (via
removed_account_rx.lock() and fetching_accounts.lock()), which can panic;
replace these expect() calls with handling that recovers from a poisoned mutex
or propagates the error: use match or .lock().map_err(...) and either call
e.into_inner() to retrieve the inner value and continue cleanup, or
return/propagate the error from the teardown function (or log and continue) so
the process does not abort; update the blocks referencing removed_account_rx and
fetching_accounts to use this non-panicking behavior.
- Around line 824-899: Shutdown does not stop the fire-and-forget
spawn_deferred_pubsub_clients task, allowing late connections to reattach
clients via SubMuxClient::add_client; fix it by having
spawn_deferred_pubsub_clients return and store a JoinHandle (e.g.,
deferred_pubsub_task_handle) on the RemoteAccountProvider struct and/or make
that task check the existing shutdown_token before attaching new clients
(SubMuxClient::add_client), then in RemoteAccountProvider::shutdown() cancel the
shutdown_token (already present) and also take/timeout/abort and await the
deferred_pubsub_task_handle similar to account_updates_task_handle and
active_subscriptions_task_handle so the background task cannot resubscribe after
shutdown.
In `@magicblock-chainlink/src/submux/mod.rs`:
- Around line 948-964: The shutdown() function currently uses .expect() on three
mutexes (program_subs, dedup_cache, debounce_states) which can panic if
poisoned; change these to non-panicking handling: replace each
.lock().expect("...") call with code that matches on the Result from .lock(),
and either (a) recover the poisoned mutex by calling e.into_inner() and proceed
(optionally logging the poison), or (b) convert the poison into a
RemoteAccountProviderResult error and return early—choose one consistent
approach; ensure
AccountSubscriptionTask::Shutdown.process(self.clients_snapshot()).await and
shutdown_token.cancel() remain unaffected and include contextual logging when
recovering or returning an error so shutdown() does not abort due to a poisoned
mutex.
In `@magicblock-chainlink/tests/10_non_primary_gating.rs`:
- Around line 144-150: The loop currently iterates modes but never applies them
— change the test to actually configure the startup/coordination mode per
iteration by either adding a parameter to new_endpoint_chainlink(mode) or by
calling a setter on the returned chainlink (e.g., invoke a method like
chainlink.set_mode(mode) or configure the coordination/startup before running
assertions); ensure you use the mode string from the loop when
creating/configuring the endpoint so that each iteration exercises Replica,
ReplicaOnly, and StandBy behavior prior to calling
assert_no_runtime_or_remote_work and assert_public_calls_do_not_start_runtime.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3b4dbb6b-a8d1-4ece-b4ad-4f096f6c9596
📒 Files selected for processing (19)
magicblock-api/src/magic_validator.rsmagicblock-chainlink/src/chainlink/errors.rsmagicblock-chainlink/src/chainlink/fetch_cloner/mod.rsmagicblock-chainlink/src/chainlink/fetch_cloner/tests.rsmagicblock-chainlink/src/chainlink/mod.rsmagicblock-chainlink/src/remote_account_provider/chain_laser_actor/actor.rsmagicblock-chainlink/src/remote_account_provider/chain_laser_actor/stream_manager.rsmagicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rsmagicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rsmagicblock-chainlink/src/remote_account_provider/chain_updates_client.rsmagicblock-chainlink/src/remote_account_provider/endpoint.rsmagicblock-chainlink/src/remote_account_provider/mod.rsmagicblock-chainlink/src/remote_account_provider/tests.rsmagicblock-chainlink/src/submux/mod.rsmagicblock-chainlink/tests/09_waiter_reconciliation_race.rsmagicblock-chainlink/tests/10_non_primary_gating.rsmagicblock-chainlink/tests/11_lifecycle_shutdown.rsmagicblock-chainlink/tests/utils/test_context.rsmagicblock-replicator/src/service/context.rs
| enable_primary().await?; | ||
| send_scheduler_primary().await?; | ||
| Ok(()) |
There was a problem hiding this comment.
Add rollback if scheduler-primary publish fails after enable_primary().
Current ordering can return Err with Chainlink already enabled, leaving a partial startup state (runtime active, scheduler not in primary mode).
Suggested rollback pattern
async fn run_standalone_primary_start_sequence<
ResetBank,
EnablePrimary,
EnablePrimaryFuture,
+ DisablePrimary,
+ DisablePrimaryFuture,
SendSchedulerPrimary,
SendSchedulerPrimaryFuture,
>(
should_reset_accounts_bank: bool,
reset_bank: ResetBank,
enable_primary: EnablePrimary,
+ disable_primary: DisablePrimary,
send_scheduler_primary: SendSchedulerPrimary,
) -> ApiResult<()>
where
ResetBank: FnOnce() -> ApiResult<()>,
EnablePrimary: FnOnce() -> EnablePrimaryFuture,
EnablePrimaryFuture: Future<Output = ApiResult<()>>,
+ DisablePrimary: FnOnce() -> DisablePrimaryFuture,
+ DisablePrimaryFuture: Future<Output = ApiResult<()>>,
SendSchedulerPrimary: FnOnce() -> SendSchedulerPrimaryFuture,
SendSchedulerPrimaryFuture: Future<Output = ApiResult<()>>,
{
if should_reset_accounts_bank {
reset_bank()?;
}
enable_primary().await?;
- send_scheduler_primary().await?;
+ if let Err(err) = send_scheduler_primary().await {
+ let _ = disable_primary().await; // best-effort rollback
+ return Err(err);
+ }
Ok(())
} run_standalone_primary_start_sequence(
!self.config.accountsdb.reset,
|| {
let step_start = Instant::now();
self.chainlink.reset_accounts_bank()?;
log_timing("startup", "reset_accounts_bank", step_start);
Ok(())
},
|| async {
self.chainlink.enable_primary().await.map_err(Into::into)
},
+ || async {
+ self.chainlink.disable().await.map_err(Into::into)
+ },
|| async {
self.mode_tx.send(SchedulerMode::Primary).await.map_err(
|e| {
ApiError::FailedToSendModeSwitch(format!(
"Failed to send primary mode to scheduler: {e}"
))
},
)
},
)Also applies to: 912-933
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@magicblock-api/src/magic_validator.rs` around lines 133 - 135, When calling
enable_primary().await before send_scheduler_primary().await, ensure you roll
back the primary enable if send_scheduler_primary() fails: after a failed
send_scheduler_primary(), call the complementary disable_primary().await (or the
appropriate shutdown/disable function corresponding to enable_primary) to revert
the partial startup, log any rollback errors, and then return the original send
error; apply the same rollback pattern to the other occurrence around lines
912-933 where enable_primary/send_scheduler_primary are paired.
| fn fail_all_pending_clones(&self) { | ||
| let pending = { | ||
| let mut map = self | ||
| .pending_clones | ||
| .lock() | ||
| .expect("pending_clones mutex poisoned"); |
There was a problem hiding this comment.
Don’t panic in the shutdown path.
Line 313 can crash validator teardown if pending_clones was poisoned. This helper should return an error or explicitly recover the inner state instead of aborting the process. As per coding guidelines, Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs` around lines 308 -
313, The fail_all_pending_clones helper currently calls .expect() on the
pending_clones mutex which can panic during shutdown; change it to handle a
poisoned mutex instead—either change fail_all_pending_clones to return
Result<(), E> (propagate an error) or recover the lock with
lock().unwrap_or_else(|poisoned| poisoned.into_inner()) and log the poisoning
before proceeding; update uses of pending_clones in fail_all_pending_clones and
any callers accordingly and ensure you reference the pending_clones Mutex and
the fail_all_pending_clones function name when making the change.
| async fn runtime_fetch_cloner( | ||
| &self, | ||
| ) -> Option<Arc<FetchCloner<T, U, V, C>>> { | ||
| let runtime = self.runtime.lock().await; | ||
| if ChainlinkLifecycleState::from_u8( | ||
| self.lifecycle_state.load(Ordering::SeqCst), | ||
| ) != ChainlinkLifecycleState::Enabled | ||
| { | ||
| return None; | ||
| } | ||
| runtime.as_ref().map(|runtime| runtime.fetch_cloner.clone()) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check the file exists and get basic info
wc -l magicblock-chainlink/src/chainlink/mod.rsRepository: magicblock-labs/magicblock-validator
Length of output: 126
🏁 Script executed:
# View the runtime_fetch_cloner method (around lines 610-620)
sed -n '600,630p' magicblock-chainlink/src/chainlink/mod.rsRepository: magicblock-labs/magicblock-validator
Length of output: 1066
🏁 Script executed:
# View the enable_primary method (around lines 697-775)
sed -n '690,780p' magicblock-chainlink/src/chainlink/mod.rsRepository: magicblock-labs/magicblock-validator
Length of output: 3058
🏁 Script executed:
# Search for any .unwrap() or .expect() calls in the file
rg -n '\.(unwrap|expect)\(' magicblock-chainlink/src/chainlink/mod.rsRepository: magicblock-labs/magicblock-validator
Length of output: 62
Check lifecycle state before acquiring the runtime mutex to avoid blocking on state transitions.
runtime_fetch_cloner() blocks waiting for self.runtime.lock() before checking whether the lifecycle is Enabled. Since enable_primary() holds that same lock during the async create_runtime_from_build_config() call, any callers of fetch_accounts() or ensure_accounts() will stall in the Starting state instead of immediately returning None. Add a fast-path check of lifecycle_state before locking, then re-check after acquiring the guard to prevent a concurrent demotion from handing out a FetchCloner.
Suggested fix
async fn runtime_fetch_cloner(
&self,
) -> Option<Arc<FetchCloner<T, U, V, C>>> {
+ if ChainlinkLifecycleState::from_u8(
+ self.lifecycle_state.load(Ordering::SeqCst),
+ ) != ChainlinkLifecycleState::Enabled
+ {
+ return None;
+ }
let runtime = self.runtime.lock().await;
if ChainlinkLifecycleState::from_u8(
self.lifecycle_state.load(Ordering::SeqCst),
) != ChainlinkLifecycleState::Enabled
{
return None;
}
runtime.as_ref().map(|runtime| runtime.fetch_cloner.clone())
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@magicblock-chainlink/src/chainlink/mod.rs` around lines 610 - 620, The method
runtime_fetch_cloner currently acquires self.runtime.lock().await before
checking lifecycle_state, which can block; add a fast-path check using
ChainlinkLifecycleState::from_u8(self.lifecycle_state.load(Ordering::SeqCst)) !=
ChainlinkLifecycleState::Enabled to return None immediately if not Enabled, then
after acquiring the runtime guard re-check the lifecycle_state to avoid a race
handing out a FetchCloner from runtime.fetch_cloner.clone(); update
runtime_fetch_cloner to perform (1) an early lifecycle check, (2) lock
self.runtime, and (3) a second lifecycle check before returning the cloned
FetchCloner.
| pub async fn shutdown(&self) -> RemoteAccountProviderResult<()> { | ||
| self.shutdown_token.cancel(); | ||
|
|
||
| let shutdown_result = match self.pubsub_client.shutdown().await { | ||
| Ok(()) => Ok(()), | ||
| Err(err) => { | ||
| warn!(error = %err, "Remote account pubsub shutdown failed"); | ||
| Err(err) | ||
| } | ||
| }; | ||
|
|
||
| if let Some(mut handle) = | ||
| self.account_updates_task_handle.lock().await.take() | ||
| { | ||
| match time::timeout(Duration::from_secs(2), &mut handle).await { | ||
| Ok(Ok(())) => {} | ||
| Ok(Err(err)) => { | ||
| warn!(error = %err, "Account updates task failed during shutdown"); | ||
| } | ||
| Err(_) => { | ||
| warn!( | ||
| "Account updates task did not stop in time; aborting" | ||
| ); | ||
| handle.abort(); | ||
| let _ = handle.await; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if let Some(mut handle) = | ||
| self.active_subscriptions_task_handle.lock().await.take() | ||
| { | ||
| match time::timeout(Duration::from_secs(2), &mut handle).await { | ||
| Ok(Ok(())) => {} | ||
| Ok(Err(err)) => { | ||
| warn!(error = %err, "Active subscriptions task failed during shutdown"); | ||
| } | ||
| Err(_) => { | ||
| warn!( | ||
| "Active subscriptions task did not stop in time; aborting" | ||
| ); | ||
| handle.abort(); | ||
| let _ = handle.await; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if let Some(mut removed_account_rx) = self | ||
| .removed_account_rx | ||
| .lock() | ||
| .expect("removed_account_rx lock poisoned") | ||
| .take() | ||
| { | ||
| while removed_account_rx.try_recv().is_ok() {} | ||
| } | ||
|
|
||
| { | ||
| let mut fetching = self | ||
| .fetching_accounts | ||
| .lock() | ||
| .expect("fetching_accounts lock poisoned"); | ||
| for (_, state) in fetching.drain() { | ||
| for waiter in state.waiters { | ||
| let _ = waiter.send(Err( | ||
| RemoteAccountProviderError::AccountSubscriptionsTaskFailed( | ||
| "remote account provider shut down".to_string(), | ||
| ), | ||
| )); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| self.subscription_ownership.lock().await.clear(); | ||
| self.subscription_key_locks.lock().await.clear(); | ||
|
|
||
| shutdown_result |
There was a problem hiding this comment.
Shutdown still leaves deferred pubsub attachment alive.
spawn_deferred_pubsub_clients() is fire-and-forget, and this shutdown path never cancels or joins it. A connection that resolves after Line 825 can still attach a new client and resubscribe accounts/programs through SubMuxClient::add_client(), recreating remote side effects after the provider was disabled.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@magicblock-chainlink/src/remote_account_provider/mod.rs` around lines 824 -
899, Shutdown does not stop the fire-and-forget spawn_deferred_pubsub_clients
task, allowing late connections to reattach clients via
SubMuxClient::add_client; fix it by having spawn_deferred_pubsub_clients return
and store a JoinHandle (e.g., deferred_pubsub_task_handle) on the
RemoteAccountProvider struct and/or make that task check the existing
shutdown_token before attaching new clients (SubMuxClient::add_client), then in
RemoteAccountProvider::shutdown() cancel the shutdown_token (already present)
and also take/timeout/abort and await the deferred_pubsub_task_handle similar to
account_updates_task_handle and active_subscriptions_task_handle so the
background task cannot resubscribe after shutdown.
| if let Some(mut removed_account_rx) = self | ||
| .removed_account_rx | ||
| .lock() | ||
| .expect("removed_account_rx lock poisoned") | ||
| .take() | ||
| { | ||
| while removed_account_rx.try_recv().is_ok() {} | ||
| } | ||
|
|
||
| { | ||
| let mut fetching = self | ||
| .fetching_accounts | ||
| .lock() | ||
| .expect("fetching_accounts lock poisoned"); |
There was a problem hiding this comment.
Avoid expect() during provider teardown.
Lines 874 and 884 can turn a best-effort shutdown into a process panic if either mutex was poisoned. Propagate the failure or explicitly recover the inner state instead. As per coding guidelines, Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@magicblock-chainlink/src/remote_account_provider/mod.rs` around lines 871 -
884, The teardown code currently calls .expect("... lock poisoned") on the
removed_account_rx and fetching_accounts mutexes (via removed_account_rx.lock()
and fetching_accounts.lock()), which can panic; replace these expect() calls
with handling that recovers from a poisoned mutex or propagates the error: use
match or .lock().map_err(...) and either call e.into_inner() to retrieve the
inner value and continue cleanup, or return/propagate the error from the
teardown function (or log and continue) so the process does not abort; update
the blocks referencing removed_account_rx and fetching_accounts to use this
non-panicking behavior.
| pub async fn shutdown(&self) -> RemoteAccountProviderResult<()> { | ||
| self.shutdown_token.cancel(); | ||
| let result = AccountSubscriptionTask::Shutdown | ||
| .process(self.clients_snapshot()) | ||
| .await; | ||
| self.program_subs | ||
| .lock() | ||
| .expect("program_subs lock poisoned") | ||
| .clear(); | ||
| self.dedup_cache | ||
| .lock() | ||
| .expect("dedup_cache lock poisoned") | ||
| .clear(); | ||
| self.debounce_states | ||
| .lock() | ||
| .expect("debounce_states lock poisoned") | ||
| .clear(); |
There was a problem hiding this comment.
Don’t let mux shutdown panic on poisoned state.
Lines 955, 959, and 963 can abort shutdown instead of completing it. Return an error or explicitly recover the poisoned mutex state here. As per coding guidelines, Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@magicblock-chainlink/src/submux/mod.rs` around lines 948 - 964, The
shutdown() function currently uses .expect() on three mutexes (program_subs,
dedup_cache, debounce_states) which can panic if poisoned; change these to
non-panicking handling: replace each .lock().expect("...") call with code that
matches on the Result from .lock(), and either (a) recover the poisoned mutex by
calling e.into_inner() and proceed (optionally logging the poison), or (b)
convert the poison into a RemoteAccountProviderResult error and return
early—choose one consistent approach; ensure
AccountSubscriptionTask::Shutdown.process(self.clients_snapshot()).await and
shutdown_token.cancel() remain unaffected and include contextual logging when
recovering or returning an error so shutdown() does not abort due to a poisoned
mutex.
| for mode in ["Replica", "ReplicaOnly", "StandBy"] { | ||
| let (chainlink, bank) = new_endpoint_chainlink().await; | ||
| assert_no_runtime_or_remote_work(&chainlink).await; | ||
| assert_public_calls_do_not_start_runtime(&chainlink, &bank).await; | ||
| assert_no_runtime_or_remote_work(&chainlink).await; | ||
| drop((mode, chainlink)); | ||
| } |
There was a problem hiding this comment.
Mode loop does not actually test different startup modes.
mode is never applied (only dropped), so this runs the same scenario three times and does not validate Replica/ReplicaOnly/StandBy-specific behavior. Please set the coordination/startup mode per iteration (or parameterize new_endpoint_chainlink with mode) before assertions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@magicblock-chainlink/tests/10_non_primary_gating.rs` around lines 144 - 150,
The loop currently iterates modes but never applies them — change the test to
actually configure the startup/coordination mode per iteration by either adding
a parameter to new_endpoint_chainlink(mode) or by calling a setter on the
returned chainlink (e.g., invoke a method like chainlink.set_mode(mode) or
configure the coordination/startup before running assertions); ensure you use
the mode string from the loop when creating/configuring the endpoint so that
each iteration exercises Replica, ReplicaOnly, and StandBy behavior prior to
calling assert_no_runtime_or_remote_work and
assert_public_calls_do_not_start_runtime.
Summary
Make Chainlink and RPC behavior coordination-mode aware so replica/standby validators stay local-only until they become primary.
PARTIALLY ADDRESSES: #1203
Details
This PR tightens the boundary between primary and non-primary validator behavior. Remote Chainlink sync is now allowed only while the validator is in primary mode; non-primary callers either no-op or read local state. That prevents standby/replica nodes from fetching, cloning, subscribing, evicting, or accepting writes before promotion.
Follow-up: full Chainlink teardown
One caveat is that this PR gates Chainlink side effects in non-primary modes, but it does not
fully tear down the already-created Chainlink provider runtime when demoting. Similarly, if a
validator starts in a non-primary mode, Chainlink provider construction does still happen during
validator initialization: WebSocket/gRPC provider objects, subscription bookkeeping, and
related background machinery may exist and receive updates even though those updates are not
allowed to mutate local state or trigger remote-sync side effects.
Fully shutting down and recreating the Chainlink runtime on mode transitions — including
avoiding provider startup entirely until promotion to primary — is a larger lifecycle refactor.
It needs explicit ownership for provider tasks/streams, bounded shutdown, fresh startup before
publishing primary mode, and additional lifecycle tests. To keep this PR manageable and focused, that work will be addressed in a follow-up PR.
Steps when Switching Modes
Switching to Primary Mode
For standalone startup, the validator performs account-bank cleanup before publishing primary mode. For replicated validators, the same cleanup runs during promotion after replica replay work is idle.
Important ordering note: this PR does not wait for every Chainlink account/program/clock subscription to be fully established before publishing
Primaryor accepting primary traffic. Provider construction happens earlier during validator initialization, while actual remote sync/subscription side effects remain gated until the validator is primary.Primary. At this point primary traffic can be accepted.chainlink.enable_primary(). The current implementation keeps this lightweight: it marks/logs the lifecycle boundary and relies on theCoordinationMode::Primarygate to allow remote Chainlink work.Switching from Primary Mode
When a primary demotes into standby/replica behavior, the order is intentionally reversed around externally visible behavior: stop being primary first, then perform standby setup.
Replicaimmediately at the start of standby transition.chainlink.disable()to mark/log the non-primary lifecycle boundary.AccountsDb, undelegation subscription setup is skipped, and removed-account eviction notifications are ignored instead of submitting eviction transactions.sendTransactionandsimulateTransaction) reject while non-primary, so they cannot trigger transaction preparation, account ensure, signature-cache mutation, or scheduler simulation.magicblock-aperture
RPC dispatch now checks the current coordination mode.
sendTransactionis rejected while non-primary before it can affect the transaction cache, andsimulateTransactionis rejected before transaction preparation, account ensuring, or scheduler simulation. Account reads remain usable against local state. Tests cover non-primary transaction RPC rejection and local account reads.magicblock-chainlink
Chainlink now has a central remote-sync gate based on
CoordinationMode, plus lightweight lifecycle methods for primary enablement and disablement. Public side-effecting APIs skip remote work in replica mode, and tests verify ensure/fetch/undelegation behavior remains side-effect free while non-primary.magicblock-api
Standalone validators now reset the account bank only as part of primary readiness, immediately before entering primary mode and enabling Chainlink lifecycle behavior. Replicated validators skip startup reset and defer cleanup until promotion.
magicblock-replicator
Promotion ordering is documented and covered: wait for scheduler idle, reset account bank, enter primary mode, then enable Chainlink lifecycle behavior. Demotion now switches to replica mode before standby setup can block or retry, preventing stale primary behavior from continuing during transition. Replica-only behavior is covered to ensure it cannot reach the primary path.
Summary by CodeRabbit
Bug Fixes
Refactor
New Features
Tests