feat: gate chainlink by validator mode by thlorenz · Pull Request #1227 · magicblock-labs/magicblock-validator

thlorenz · 2026-05-20T06:28:07Z

Summary

Make Chainlink and RPC behavior coordination-mode aware so replica/standby validators stay local-only until they become primary.

Reject non-primary transaction RPCs before transaction preparation, signature-cache mutation, account ensuring, or scheduler simulation.
Keep account reads available in replica mode without triggering remote Chainlink work.
Move account-bank reset and Chainlink enablement to primary-readiness transitions.
Switch out of primary mode before standby setup during demotion.
Add coverage for RPC side-effect prevention, Chainlink non-primary gating, and replication lifecycle ordering.

PARTIALLY ADDRESSES: #1203

Details

This PR tightens the boundary between primary and non-primary validator behavior. Remote Chainlink sync is now allowed only while the validator is in primary mode; non-primary callers either no-op or read local state. That prevents standby/replica nodes from fetching, cloning, subscribing, evicting, or accepting writes before promotion.

Follow-up: full Chainlink teardown

One caveat is that this PR gates Chainlink side effects in non-primary modes, but it does not
fully tear down the already-created Chainlink provider runtime when demoting. Similarly, if a
validator starts in a non-primary mode, Chainlink provider construction does still happen during
validator initialization: WebSocket/gRPC provider objects, subscription bookkeeping, and
related background machinery may exist and receive updates even though those updates are not
allowed to mutate local state or trigger remote-sync side effects.

Fully shutting down and recreating the Chainlink runtime on mode transitions — including
avoiding provider startup entirely until promotion to primary — is a larger lifecycle refactor.
It needs explicit ownership for provider tasks/streams, bounded shutdown, fresh startup before
publishing primary mode, and additional lifecycle tests. To keep this PR manageable and focused, that work will be addressed in a follow-up PR.

Steps when Switching Modes

Switching to Primary Mode

For standalone startup, the validator performs account-bank cleanup before publishing primary mode. For replicated validators, the same cleanup runs during promotion after replica replay work is idle.

Important ordering note: this PR does not wait for every Chainlink account/program/clock subscription to be fully established before publishing Primary or accepting primary traffic. Provider construction happens earlier during validator initialization, while actual remote sync/subscription side effects remain gated until the validator is primary.

During validator initialization, Chainlink builds the remote account provider from configured endpoints, attaches WebSocket or gRPC/Laserstream pubsub clients, wires the FetchCloner, creates the removed-account subscription channel, and prepares program/account subscription management. These objects may exist before primary mode, but their side-effecting paths are mode-gated.
During promotion, wait for the scheduler to become idle so no replica/standby work is still being processed.
Reset the account bank as primary-readiness cleanup. This removes non-delegated, non-blacklisted accounts from the local bank while preserving delegated, undelegating, ephemeral, and protected accounts.
Switch the scheduler/global coordination state to Primary. At this point primary traffic can be accepted.
Call chainlink.enable_primary(). The current implementation keeps this lightweight: it marks/logs the lifecycle boundary and relies on the CoordinationMode::Primary gate to allow remote Chainlink work.
Once primary mode is visible, Chainlink remote paths are allowed again. Account ensure/fetch can use the configured remote account provider, clone missing state, subscribe to monitored accounts, set up program subscriptions, process clock/account updates, reconcile active account subscriptions, and submit eviction/undelegation-related work when those paths are requested.
In other words, this PR guarantees that Chainlink side effects cannot happen before primary mode; it does not add a blocking warm-up phase that pre-establishes all subscriptions before primary mode is published.

Switching from Primary Mode

When a primary demotes into standby/replica behavior, the order is intentionally reversed around externally visible behavior: stop being primary first, then perform standby setup.

Switch the scheduler/global coordination state to Replica immediately at the start of standby transition.
Call chainlink.disable() to mark/log the non-primary lifecycle boundary.
From that point, Chainlink side-effecting entry points are gated: account ensure returns no-op results, fetches read local AccountsDb, undelegation subscription setup is skipped, and removed-account eviction notifications are ignored instead of submitting eviction transactions.
Transaction RPCs (sendTransaction and simulateTransaction) reject while non-primary, so they cannot trigger transaction preparation, account ensure, signature-cache mutation, or scheduler simulation.
Account read RPCs remain local-only and do not invoke remote Chainlink ensure.
Standby setup then creates the replica consumer and lock watcher. If that setup blocks or retries, the validator is already non-primary and cannot keep accepting primary-only Chainlink/RPC work.
Existing provider objects, WebSocket/gRPC clients, program subscriptions, clock subscriptions, and subscription bookkeeping are not fully torn down by this PR; instead, their mutation/remote-sync effects are blocked by the coordination-mode gates until the validator becomes primary again.

magicblock-aperture

RPC dispatch now checks the current coordination mode. sendTransaction is rejected while non-primary before it can affect the transaction cache, and simulateTransaction is rejected before transaction preparation, account ensuring, or scheduler simulation. Account reads remain usable against local state. Tests cover non-primary transaction RPC rejection and local account reads.

magicblock-chainlink

Chainlink now has a central remote-sync gate based on CoordinationMode, plus lightweight lifecycle methods for primary enablement and disablement. Public side-effecting APIs skip remote work in replica mode, and tests verify ensure/fetch/undelegation behavior remains side-effect free while non-primary.

magicblock-api

Standalone validators now reset the account bank only as part of primary readiness, immediately before entering primary mode and enabling Chainlink lifecycle behavior. Replicated validators skip startup reset and defer cleanup until promotion.

magicblock-replicator

Promotion ordering is documented and covered: wait for scheduler idle, reset account bank, enter primary mode, then enable Chainlink lifecycle behavior. Demotion now switches to replica mode before standby setup can block or retry, preventing stale primary behavior from continuing during transition. Replica-only behavior is covered to ensure it cannot reach the primary path.

Summary by CodeRabbit

Bug Fixes
- Write-like RPCs (send/simulate transactions) are rejected outside primary mode; replica reads return local account data.
Refactor
- Startup/role-transition ordering and lifecycle sequencing for primary/replica modes improved.
New Features
- Graceful shutdown and lifecycle controls added for remote sync and subscription components.
Tests
- Expanded tests covering non-primary gating, lifecycle enable/disable, and shutdown behavior.

* master: feat: improve restart time (#1224) fix: joinset result ordering (#1214) feat: check AML risks (#1134) chore: restart metrics on info + restart optimizations (#1221) .github/workflows: Migrate workflows to Blacksmith runners (#1219) fix: race/order violation of intents (#1217)

coderabbitai · 2026-05-20T06:28:20Z

📝 Walkthrough

Walkthrough

Adds coordination-mode guards to HTTP handlers to reject write-like RPCs on replicas and short-circuits account/transaction ensure paths in replica mode. Refactors Chainlink into a runtime-backed lifecycle model with enable_primary/disable and runtime_fetch_cloner gating, and implements cooperative shutdown for FetchCloner and RemoteAccountProvider (including SubMuxClient and pubsub mocks). Reorders validator/replicator startup to run ordered primary/standby readiness sequences. Extensive tests and test-helper updates validate gating, lifecycle toggles, shutdowns, and sequencing.

Suggested reviewers

GabrielePicco
snawaz

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thlorenz/replica-aware-chainlink

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

magicblock-api/src/magic_validator.rs (1)
879-901: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't treat mode_tx.send(...) as a completed Primary transition.

This only queues SchedulerMode::Primary. The new HTTP/chainlink gates read CoordinationMode::current(), so startup can continue past enable_primary() while the old mode is still visible. Please make the coordination-mode switch synchronous here, or wait for an acknowledgment before enabling primary-only behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-api/src/magic_validator.rs` around lines 879 - 901, The code
currently only enqueues SchedulerMode::Primary via mode_tx.send(...) which
doesn't guarantee CoordinationMode::current() has switched before calling
chainlink.enable_primary(); change this to perform a synchronous
coordination-mode transition or wait for an explicit acknowledgment: either (a)
invoke the coordination-mode switch synchronously (so
CoordinationMode::current() reflects Primary before calling
chainlink.enable_primary()), or (b) extend the scheduler handshake to
return/await an ack after processing SchedulerMode::Primary (e.g., send a
request that the scheduler replies to) and only call chainlink.enable_primary()
after receiving that ack; update the logic around mode_tx.send,
SchedulerMode::Primary, and chainlink.enable_primary to ensure the coordination
mode is visible before enabling primary-only behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@magicblock-replicator/src/service/context.rs`:
- Around line 244-249: The promotion/demotion path currently enqueues
SchedulerMode inside enter_primary_mode()/enter_replica_mode() but silently
drops send failures, so the rest of
run_primary_readiness_sequence()/run_replica_readiness_sequence() can proceed
while CoordinationMode::current() is stale or the receiver is gone; change
enter_primary_mode and enter_replica_mode to return Result<(), Error> (or a
suitable error type) and have them propagate the failure of mode_tx.send(...)
instead of ignoring it, then update the calls in run_primary_readiness_sequence
and the analogous replica sequence to invoke those methods through the fallible
closures you already pass (e.g. replace || self.enter_primary_mode() with a
closure that returns the Result), so the readiness sequence aborts on
send/send-failure and the mode switch is externally visible; ensure
SchedulerMode send failures are converted into the same error type you return so
callers can handle them.

---

Outside diff comments:
In `@magicblock-api/src/magic_validator.rs`:
- Around line 879-901: The code currently only enqueues SchedulerMode::Primary
via mode_tx.send(...) which doesn't guarantee CoordinationMode::current() has
switched before calling chainlink.enable_primary(); change this to perform a
synchronous coordination-mode transition or wait for an explicit acknowledgment:
either (a) invoke the coordination-mode switch synchronously (so
CoordinationMode::current() reflects Primary before calling
chainlink.enable_primary()), or (b) extend the scheduler handshake to
return/await an ack after processing SchedulerMode::Primary (e.g., send a
request that the scheduler replies to) and only call chainlink.enable_primary()
after receiving that ack; update the logic around mode_tx.send,
SchedulerMode::Primary, and chainlink.enable_primary to ensure the coordination
mode is visible before enabling primary-only behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 43480cb3-ed7b-4cd8-8ece-e9efe4b990f5

📥 Commits

Reviewing files that changed from the base of the PR and between 25df83c and fc9b68b.

📒 Files selected for processing (10)

magicblock-aperture/src/requests/http/mod.rs
magicblock-aperture/src/requests/http/send_transaction.rs
magicblock-aperture/src/requests/http/simulate_transaction.rs
magicblock-api/src/magic_validator.rs
magicblock-chainlink/src/chainlink/mod.rs
magicblock-chainlink/tests/10_non_primary_gating.rs
magicblock-chainlink/tests/utils/test_context.rs
magicblock-replicator/src/error.rs
magicblock-replicator/src/service/context.rs
magicblock-replicator/src/service/mod.rs

coderabbitai · 2026-05-20T06:33:45Z

+        run_primary_readiness_sequence(
+            || self.scheduler.wait_for_idle(),
+            || self.chainlink.reset_accounts_bank().map_err(Into::into),
+            || self.enter_primary_mode(),
+            || self.chainlink.enable_primary().map_err(Into::into),
+        )


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Promotion/demotion still race on stale coordination-mode visibility.

These helpers continue after enter_primary_mode() / enter_replica_mode(), but those methods only enqueue SchedulerMode on mode_tx and currently drop send failures. Since the new HTTP and chainlink gates consult CoordinationMode::current(), the node can keep behaving in the old mode—or proceed after the receiver is gone—while reset/enable/disable and consumer setup continue. Make the mode switch fallible and externally visible before running the rest of the sequence.

Also applies to: 264-268

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@magicblock-replicator/src/service/context.rs` around lines 244 - 249, The promotion/demotion path currently enqueues SchedulerMode inside enter_primary_mode()/enter_replica_mode() but silently drops send failures, so the rest of run_primary_readiness_sequence()/run_replica_readiness_sequence() can proceed while CoordinationMode::current() is stale or the receiver is gone; change enter_primary_mode and enter_replica_mode to return Result<(), Error> (or a suitable error type) and have them propagate the failure of mode_tx.send(...) instead of ignoring it, then update the calls in run_primary_readiness_sequence and the analogous replica sequence to invoke those methods through the fallible closures you already pass (e.g. replace || self.enter_primary_mode() with a closure that returns the Result), so the readiness sequence aborts on send/send-failure and the mode switch is externally visible; ensure SchedulerMode send failures are converted into the same error type you return so callers can handle them.

* master: fix: CommitFinalize post-commit actions (#1220)

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@magicblock-api/src/magic_validator.rs`:
- Around line 133-135: When calling enable_primary().await before
send_scheduler_primary().await, ensure you roll back the primary enable if
send_scheduler_primary() fails: after a failed send_scheduler_primary(), call
the complementary disable_primary().await (or the appropriate shutdown/disable
function corresponding to enable_primary) to revert the partial startup, log any
rollback errors, and then return the original send error; apply the same
rollback pattern to the other occurrence around lines 912-933 where
enable_primary/send_scheduler_primary are paired.

In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs`:
- Around line 308-313: The fail_all_pending_clones helper currently calls
.expect() on the pending_clones mutex which can panic during shutdown; change it
to handle a poisoned mutex instead—either change fail_all_pending_clones to
return Result<(), E> (propagate an error) or recover the lock with
lock().unwrap_or_else(|poisoned| poisoned.into_inner()) and log the poisoning
before proceeding; update uses of pending_clones in fail_all_pending_clones and
any callers accordingly and ensure you reference the pending_clones Mutex and
the fail_all_pending_clones function name when making the change.

In `@magicblock-chainlink/src/chainlink/mod.rs`:
- Around line 610-620: The method runtime_fetch_cloner currently acquires
self.runtime.lock().await before checking lifecycle_state, which can block; add
a fast-path check using
ChainlinkLifecycleState::from_u8(self.lifecycle_state.load(Ordering::SeqCst)) !=
ChainlinkLifecycleState::Enabled to return None immediately if not Enabled, then
after acquiring the runtime guard re-check the lifecycle_state to avoid a race
handing out a FetchCloner from runtime.fetch_cloner.clone(); update
runtime_fetch_cloner to perform (1) an early lifecycle check, (2) lock
self.runtime, and (3) a second lifecycle check before returning the cloned
FetchCloner.

In `@magicblock-chainlink/src/remote_account_provider/mod.rs`:
- Around line 871-884: The teardown code currently calls .expect("... lock
poisoned") on the removed_account_rx and fetching_accounts mutexes (via
removed_account_rx.lock() and fetching_accounts.lock()), which can panic;
replace these expect() calls with handling that recovers from a poisoned mutex
or propagates the error: use match or .lock().map_err(...) and either call
e.into_inner() to retrieve the inner value and continue cleanup, or
return/propagate the error from the teardown function (or log and continue) so
the process does not abort; update the blocks referencing removed_account_rx and
fetching_accounts to use this non-panicking behavior.
- Around line 824-899: Shutdown does not stop the fire-and-forget
spawn_deferred_pubsub_clients task, allowing late connections to reattach
clients via SubMuxClient::add_client; fix it by having
spawn_deferred_pubsub_clients return and store a JoinHandle (e.g.,
deferred_pubsub_task_handle) on the RemoteAccountProvider struct and/or make
that task check the existing shutdown_token before attaching new clients
(SubMuxClient::add_client), then in RemoteAccountProvider::shutdown() cancel the
shutdown_token (already present) and also take/timeout/abort and await the
deferred_pubsub_task_handle similar to account_updates_task_handle and
active_subscriptions_task_handle so the background task cannot resubscribe after
shutdown.

In `@magicblock-chainlink/src/submux/mod.rs`:
- Around line 948-964: The shutdown() function currently uses .expect() on three
mutexes (program_subs, dedup_cache, debounce_states) which can panic if
poisoned; change these to non-panicking handling: replace each
.lock().expect("...") call with code that matches on the Result from .lock(),
and either (a) recover the poisoned mutex by calling e.into_inner() and proceed
(optionally logging the poison), or (b) convert the poison into a
RemoteAccountProviderResult error and return early—choose one consistent
approach; ensure
AccountSubscriptionTask::Shutdown.process(self.clients_snapshot()).await and
shutdown_token.cancel() remain unaffected and include contextual logging when
recovering or returning an error so shutdown() does not abort due to a poisoned
mutex.

In `@magicblock-chainlink/tests/10_non_primary_gating.rs`:
- Around line 144-150: The loop currently iterates modes but never applies them
— change the test to actually configure the startup/coordination mode per
iteration by either adding a parameter to new_endpoint_chainlink(mode) or by
calling a setter on the returned chainlink (e.g., invoke a method like
chainlink.set_mode(mode) or configure the coordination/startup before running
assertions); ensure you use the mode string from the loop when
creating/configuring the endpoint so that each iteration exercises Replica,
ReplicaOnly, and StandBy behavior prior to calling
assert_no_runtime_or_remote_work and assert_public_calls_do_not_start_runtime.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3b4dbb6b-a8d1-4ece-b4ad-4f096f6c9596

📥 Commits

Reviewing files that changed from the base of the PR and between fc9b68b and 7946cc0.

📒 Files selected for processing (19)

magicblock-api/src/magic_validator.rs
magicblock-chainlink/src/chainlink/errors.rs
magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs
magicblock-chainlink/src/chainlink/fetch_cloner/tests.rs
magicblock-chainlink/src/chainlink/mod.rs
magicblock-chainlink/src/remote_account_provider/chain_laser_actor/actor.rs
magicblock-chainlink/src/remote_account_provider/chain_laser_actor/stream_manager.rs
magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs
magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs
magicblock-chainlink/src/remote_account_provider/chain_updates_client.rs
magicblock-chainlink/src/remote_account_provider/endpoint.rs
magicblock-chainlink/src/remote_account_provider/mod.rs
magicblock-chainlink/src/remote_account_provider/tests.rs
magicblock-chainlink/src/submux/mod.rs
magicblock-chainlink/tests/09_waiter_reconciliation_race.rs
magicblock-chainlink/tests/10_non_primary_gating.rs
magicblock-chainlink/tests/11_lifecycle_shutdown.rs
magicblock-chainlink/tests/utils/test_context.rs
magicblock-replicator/src/service/context.rs

coderabbitai · 2026-05-20T11:36:20Z

+    enable_primary().await?;
+    send_scheduler_primary().await?;
+    Ok(())


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add rollback if scheduler-primary publish fails after enable_primary().

Current ordering can return Err with Chainlink already enabled, leaving a partial startup state (runtime active, scheduler not in primary mode).

Suggested rollback pattern

async fn run_standalone_primary_start_sequence< ResetBank, EnablePrimary, EnablePrimaryFuture, + DisablePrimary, + DisablePrimaryFuture, SendSchedulerPrimary, SendSchedulerPrimaryFuture, >( should_reset_accounts_bank: bool, reset_bank: ResetBank, enable_primary: EnablePrimary, + disable_primary: DisablePrimary, send_scheduler_primary: SendSchedulerPrimary, ) -> ApiResult<()> where ResetBank: FnOnce() -> ApiResult<()>, EnablePrimary: FnOnce() -> EnablePrimaryFuture, EnablePrimaryFuture: Future<Output = ApiResult<()>>, + DisablePrimary: FnOnce() -> DisablePrimaryFuture, + DisablePrimaryFuture: Future<Output = ApiResult<()>>, SendSchedulerPrimary: FnOnce() -> SendSchedulerPrimaryFuture, SendSchedulerPrimaryFuture: Future<Output = ApiResult<()>>, { if should_reset_accounts_bank { reset_bank()?; } enable_primary().await?; - send_scheduler_primary().await?; + if let Err(err) = send_scheduler_primary().await { + let _ = disable_primary().await; // best-effort rollback + return Err(err); + } Ok(()) }

run_standalone_primary_start_sequence( !self.config.accountsdb.reset, || { let step_start = Instant::now(); self.chainlink.reset_accounts_bank()?; log_timing("startup", "reset_accounts_bank", step_start); Ok(()) }, || async { self.chainlink.enable_primary().await.map_err(Into::into) }, + || async { + self.chainlink.disable().await.map_err(Into::into) + }, || async { self.mode_tx.send(SchedulerMode::Primary).await.map_err( |e| { ApiError::FailedToSendModeSwitch(format!( "Failed to send primary mode to scheduler: {e}" )) }, ) }, )

Also applies to: 912-933

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@magicblock-api/src/magic_validator.rs` around lines 133 - 135, When calling enable_primary().await before send_scheduler_primary().await, ensure you roll back the primary enable if send_scheduler_primary() fails: after a failed send_scheduler_primary(), call the complementary disable_primary().await (or the appropriate shutdown/disable function corresponding to enable_primary) to revert the partial startup, log any rollback errors, and then return the original send error; apply the same rollback pattern to the other occurrence around lines 912-933 where enable_primary/send_scheduler_primary are paired.

coderabbitai · 2026-05-20T11:36:20Z

+    fn fail_all_pending_clones(&self) {
+        let pending = {
+            let mut map = self
+                .pending_clones
+                .lock()
+                .expect("pending_clones mutex poisoned");


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t panic in the shutdown path.

Line 313 can crash validator teardown if pending_clones was poisoned. This helper should return an error or explicitly recover the inner state instead of aborting the process. As per coding guidelines, Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs` around lines 308 - 313, The fail_all_pending_clones helper currently calls .expect() on the pending_clones mutex which can panic during shutdown; change it to handle a poisoned mutex instead—either change fail_all_pending_clones to return Result<(), E> (propagate an error) or recover the lock with lock().unwrap_or_else(|poisoned| poisoned.into_inner()) and log the poisoning before proceeding; update uses of pending_clones in fail_all_pending_clones and any callers accordingly and ensure you reference the pending_clones Mutex and the fail_all_pending_clones function name when making the change.

coderabbitai · 2026-05-20T11:36:20Z

+    async fn runtime_fetch_cloner(
+        &self,
+    ) -> Option<Arc<FetchCloner<T, U, V, C>>> {
+        let runtime = self.runtime.lock().await;
+        if ChainlinkLifecycleState::from_u8(
+            self.lifecycle_state.load(Ordering::SeqCst),
+        ) != ChainlinkLifecycleState::Enabled
+        {
+            return None;
+        }
+        runtime.as_ref().map(|runtime| runtime.fetch_cloner.clone())


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# Check the file exists and get basic info wc -l magicblock-chainlink/src/chainlink/mod.rs

Repository: magicblock-labs/magicblock-validator

Length of output: 126

🏁 Script executed:

# View the runtime_fetch_cloner method (around lines 610-620) sed -n '600,630p' magicblock-chainlink/src/chainlink/mod.rs

Repository: magicblock-labs/magicblock-validator

Length of output: 1066

🏁 Script executed:

# View the enable_primary method (around lines 697-775) sed -n '690,780p' magicblock-chainlink/src/chainlink/mod.rs

Repository: magicblock-labs/magicblock-validator

Length of output: 3058

🏁 Script executed:

# Search for any .unwrap() or .expect() calls in the file rg -n '\.(unwrap|expect)\(' magicblock-chainlink/src/chainlink/mod.rs

Repository: magicblock-labs/magicblock-validator

Length of output: 62

Check lifecycle state before acquiring the runtime mutex to avoid blocking on state transitions.

runtime_fetch_cloner() blocks waiting for self.runtime.lock() before checking whether the lifecycle is Enabled. Since enable_primary() holds that same lock during the async create_runtime_from_build_config() call, any callers of fetch_accounts() or ensure_accounts() will stall in the Starting state instead of immediately returning None. Add a fast-path check of lifecycle_state before locking, then re-check after acquiring the guard to prevent a concurrent demotion from handing out a FetchCloner.

Suggested fix

async fn runtime_fetch_cloner( &self, ) -> Option<Arc<FetchCloner<T, U, V, C>>> { + if ChainlinkLifecycleState::from_u8( + self.lifecycle_state.load(Ordering::SeqCst), + ) != ChainlinkLifecycleState::Enabled + { + return None; + } let runtime = self.runtime.lock().await; if ChainlinkLifecycleState::from_u8( self.lifecycle_state.load(Ordering::SeqCst), ) != ChainlinkLifecycleState::Enabled { return None; } runtime.as_ref().map(|runtime| runtime.fetch_cloner.clone()) }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@magicblock-chainlink/src/chainlink/mod.rs` around lines 610 - 620, The method runtime_fetch_cloner currently acquires self.runtime.lock().await before checking lifecycle_state, which can block; add a fast-path check using ChainlinkLifecycleState::from_u8(self.lifecycle_state.load(Ordering::SeqCst)) != ChainlinkLifecycleState::Enabled to return None immediately if not Enabled, then after acquiring the runtime guard re-check the lifecycle_state to avoid a race handing out a FetchCloner from runtime.fetch_cloner.clone(); update runtime_fetch_cloner to perform (1) an early lifecycle check, (2) lock self.runtime, and (3) a second lifecycle check before returning the cloned FetchCloner.

coderabbitai · 2026-05-20T11:36:20Z

+    pub async fn shutdown(&self) -> RemoteAccountProviderResult<()> {
+        self.shutdown_token.cancel();
+
+        let shutdown_result = match self.pubsub_client.shutdown().await {
+            Ok(()) => Ok(()),
+            Err(err) => {
+                warn!(error = %err, "Remote account pubsub shutdown failed");
+                Err(err)
+            }
+        };
+
+        if let Some(mut handle) =
+            self.account_updates_task_handle.lock().await.take()
+        {
+            match time::timeout(Duration::from_secs(2), &mut handle).await {
+                Ok(Ok(())) => {}
+                Ok(Err(err)) => {
+                    warn!(error = %err, "Account updates task failed during shutdown");
+                }
+                Err(_) => {
+                    warn!(
+                        "Account updates task did not stop in time; aborting"
+                    );
+                    handle.abort();
+                    let _ = handle.await;
+                }
+            }
+        }
+
+        if let Some(mut handle) =
+            self.active_subscriptions_task_handle.lock().await.take()
+        {
+            match time::timeout(Duration::from_secs(2), &mut handle).await {
+                Ok(Ok(())) => {}
+                Ok(Err(err)) => {
+                    warn!(error = %err, "Active subscriptions task failed during shutdown");
+                }
+                Err(_) => {
+                    warn!(
+                        "Active subscriptions task did not stop in time; aborting"
+                    );
+                    handle.abort();
+                    let _ = handle.await;
+                }
+            }
+        }
+
+        if let Some(mut removed_account_rx) = self
+            .removed_account_rx
+            .lock()
+            .expect("removed_account_rx lock poisoned")
+            .take()
+        {
+            while removed_account_rx.try_recv().is_ok() {}
+        }
+
+        {
+            let mut fetching = self
+                .fetching_accounts
+                .lock()
+                .expect("fetching_accounts lock poisoned");
+            for (_, state) in fetching.drain() {
+                for waiter in state.waiters {
+                    let _ = waiter.send(Err(
+                        RemoteAccountProviderError::AccountSubscriptionsTaskFailed(
+                            "remote account provider shut down".to_string(),
+                        ),
+                    ));
+                }
+            }
+        }
+
+        self.subscription_ownership.lock().await.clear();
+        self.subscription_key_locks.lock().await.clear();
+
+        shutdown_result


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Shutdown still leaves deferred pubsub attachment alive.

spawn_deferred_pubsub_clients() is fire-and-forget, and this shutdown path never cancels or joins it. A connection that resolves after Line 825 can still attach a new client and resubscribe accounts/programs through SubMuxClient::add_client(), recreating remote side effects after the provider was disabled.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@magicblock-chainlink/src/remote_account_provider/mod.rs` around lines 824 - 899, Shutdown does not stop the fire-and-forget spawn_deferred_pubsub_clients task, allowing late connections to reattach clients via SubMuxClient::add_client; fix it by having spawn_deferred_pubsub_clients return and store a JoinHandle (e.g., deferred_pubsub_task_handle) on the RemoteAccountProvider struct and/or make that task check the existing shutdown_token before attaching new clients (SubMuxClient::add_client), then in RemoteAccountProvider::shutdown() cancel the shutdown_token (already present) and also take/timeout/abort and await the deferred_pubsub_task_handle similar to account_updates_task_handle and active_subscriptions_task_handle so the background task cannot resubscribe after shutdown.

coderabbitai · 2026-05-20T11:36:20Z

+        if let Some(mut removed_account_rx) = self
+            .removed_account_rx
+            .lock()
+            .expect("removed_account_rx lock poisoned")
+            .take()
+        {
+            while removed_account_rx.try_recv().is_ok() {}
+        }
+
+        {
+            let mut fetching = self
+                .fetching_accounts
+                .lock()
+                .expect("fetching_accounts lock poisoned");


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid expect() during provider teardown.

Lines 874 and 884 can turn a best-effort shutdown into a process panic if either mutex was poisoned. Propagate the failure or explicitly recover the inner state instead. As per coding guidelines, Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@magicblock-chainlink/src/remote_account_provider/mod.rs` around lines 871 - 884, The teardown code currently calls .expect("... lock poisoned") on the removed_account_rx and fetching_accounts mutexes (via removed_account_rx.lock() and fetching_accounts.lock()), which can panic; replace these expect() calls with handling that recovers from a poisoned mutex or propagates the error: use match or .lock().map_err(...) and either call e.into_inner() to retrieve the inner value and continue cleanup, or return/propagate the error from the teardown function (or log and continue) so the process does not abort; update the blocks referencing removed_account_rx and fetching_accounts to use this non-panicking behavior.

coderabbitai · 2026-05-20T11:36:20Z

+    pub async fn shutdown(&self) -> RemoteAccountProviderResult<()> {
+        self.shutdown_token.cancel();
+        let result = AccountSubscriptionTask::Shutdown
+            .process(self.clients_snapshot())
+            .await;
+        self.program_subs
+            .lock()
+            .expect("program_subs lock poisoned")
+            .clear();
+        self.dedup_cache
+            .lock()
+            .expect("dedup_cache lock poisoned")
+            .clear();
+        self.debounce_states
+            .lock()
+            .expect("debounce_states lock poisoned")
+            .clear();


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t let mux shutdown panic on poisoned state.

Lines 955, 959, and 963 can abort shutdown instead of completing it. Return an error or explicitly recover the poisoned mutex state here. As per coding guidelines, Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@magicblock-chainlink/src/submux/mod.rs` around lines 948 - 964, The shutdown() function currently uses .expect() on three mutexes (program_subs, dedup_cache, debounce_states) which can panic if poisoned; change these to non-panicking handling: replace each .lock().expect("...") call with code that matches on the Result from .lock(), and either (a) recover the poisoned mutex by calling e.into_inner() and proceed (optionally logging the poison), or (b) convert the poison into a RemoteAccountProviderResult error and return early—choose one consistent approach; ensure AccountSubscriptionTask::Shutdown.process(self.clients_snapshot()).await and shutdown_token.cancel() remain unaffected and include contextual logging when recovering or returning an error so shutdown() does not abort due to a poisoned mutex.

coderabbitai · 2026-05-20T11:36:20Z

+    for mode in ["Replica", "ReplicaOnly", "StandBy"] {
+        let (chainlink, bank) = new_endpoint_chainlink().await;
+        assert_no_runtime_or_remote_work(&chainlink).await;
+        assert_public_calls_do_not_start_runtime(&chainlink, &bank).await;
+        assert_no_runtime_or_remote_work(&chainlink).await;
+        drop((mode, chainlink));
+    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mode loop does not actually test different startup modes.

mode is never applied (only dropped), so this runs the same scenario three times and does not validate Replica/ReplicaOnly/StandBy-specific behavior. Please set the coordination/startup mode per iteration (or parameterize new_endpoint_chainlink with mode) before assertions.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@magicblock-chainlink/tests/10_non_primary_gating.rs` around lines 144 - 150, The loop currently iterates modes but never applies them — change the test to actually configure the startup/coordination mode per iteration by either adding a parameter to new_endpoint_chainlink(mode) or by calling a setter on the returned chainlink (e.g., invoke a method like chainlink.set_mode(mode) or configure the coordination/startup before running assertions); ensure you use the mode string from the loop when creating/configuring the endpoint so that each iteration exercises Replica, ReplicaOnly, and StandBy behavior prior to calling assert_no_runtime_or_remote_work and assert_public_calls_do_not_start_runtime.

thlorenz added 11 commits May 19, 2026 17:40

fix: gate non-primary RPC writes

72bf945

fix: gate chainlink remote sync by mode

4b71387

fix: reset bank only before primary

05eeac0

fix: demote before standby setup

8c41d84

feat: add chainlink lifecycle hooks

8412815

test: cover replica RPC side effects

9b3c745

test: cover chainlink replica gating

9d9f7dc

test: add replication lifecycle ordering coverage

7df33b5

docs: clarify chainlink lifecycle cleanup

59ea8d0

fix: reject replica transaction simulations

fc9b68b

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

thlorenz added 17 commits May 20, 2026 15:07

chore: simplify step implementations in replicator

5e4dd23

Merge branch 'master' into thlorenz/replica-aware-chainlink

b4102fd

* master: fix: CommitFinalize post-commit actions (#1220)

feat: add chainlink lifecycle runtime state

286fee4

feat: add chainlink runtime introspection APIs

d170473

feat: add remote account provider shutdown

59987ed

feat: expose deterministic pubsub shutdown

8ef9a13

feat: add fetch cloner shutdown

6b6d810

feat(chainlink): disable runtime asynchronously

f3d5ebc

feat(chainlink): store runtime build config

0eb21d3

feat(chainlink): enable primary runtime startup

75cf215

feat(chainlink): defer endpoint runtime construction

69a28af

feat(api): publish standalone primary after chainlink readiness

3832a29

feat(replicator): reorder chainlink role transitions

1bf8b77

test(chainlink): cover lifecycle shutdown transitions

5e16b20

test(chainlink): cover lower-layer shutdown

ac87aef

test(lifecycle): cover promotion failure ordering

9cd44d2

test(chainlink): cover non-primary remote resources

2832242

thlorenz added 2 commits May 20, 2026 18:39

refactor(chainlink): remove defensive remote sync gates

15a3aff

test(chainlink): remove obsolete non-primary gate coverage

7946cc0

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: gate chainlink by validator mode#1227

feat: gate chainlink by validator mode#1227
thlorenz wants to merge 30 commits into
masterfrom
thlorenz/replica-aware-chainlink

thlorenz commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Walkthrough

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thlorenz commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Follow-up: full Chainlink teardown

Steps when Switching Modes

Switching to Primary Mode

Switching from Primary Mode

magicblock-aperture

magicblock-chainlink

magicblock-api

magicblock-replicator

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thlorenz commented May 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading