Skip to content

feat: gate chainlink by validator mode#1227

Draft
thlorenz wants to merge 30 commits into
masterfrom
thlorenz/replica-aware-chainlink
Draft

feat: gate chainlink by validator mode#1227
thlorenz wants to merge 30 commits into
masterfrom
thlorenz/replica-aware-chainlink

Conversation

@thlorenz
Copy link
Copy Markdown
Collaborator

@thlorenz thlorenz commented May 20, 2026

Summary

Make Chainlink and RPC behavior coordination-mode aware so replica/standby validators stay local-only until they become primary.

  • Reject non-primary transaction RPCs before transaction preparation, signature-cache mutation, account ensuring, or scheduler simulation.
  • Keep account reads available in replica mode without triggering remote Chainlink work.
  • Move account-bank reset and Chainlink enablement to primary-readiness transitions.
  • Switch out of primary mode before standby setup during demotion.
  • Add coverage for RPC side-effect prevention, Chainlink non-primary gating, and replication lifecycle ordering.

PARTIALLY ADDRESSES: #1203

Details

This PR tightens the boundary between primary and non-primary validator behavior. Remote Chainlink sync is now allowed only while the validator is in primary mode; non-primary callers either no-op or read local state. That prevents standby/replica nodes from fetching, cloning, subscribing, evicting, or accepting writes before promotion.

Follow-up: full Chainlink teardown

One caveat is that this PR gates Chainlink side effects in non-primary modes, but it does not
fully tear down
the already-created Chainlink provider runtime when demoting. Similarly, if a
validator starts in a non-primary mode, Chainlink provider construction does still happen
during
validator initialization: WebSocket/gRPC provider objects, subscription bookkeeping, and
related background machinery may exist and receive updates even though those updates are not
allowed to mutate local state or trigger remote-sync side effects.

Fully shutting down and recreating the Chainlink runtime on mode transitions — including
avoiding provider startup entirely until promotion to primary — is a larger lifecycle refactor.
It needs explicit ownership for provider tasks/streams, bounded shutdown, fresh startup before
publishing primary mode, and additional lifecycle tests. To keep this PR manageable and focused, that work will be addressed in a follow-up PR.

Steps when Switching Modes

Switching to Primary Mode

For standalone startup, the validator performs account-bank cleanup before publishing primary mode. For replicated validators, the same cleanup runs during promotion after replica replay work is idle.

Important ordering note: this PR does not wait for every Chainlink account/program/clock subscription to be fully established before publishing Primary or accepting primary traffic. Provider construction happens earlier during validator initialization, while actual remote sync/subscription side effects remain gated until the validator is primary.

  1. During validator initialization, Chainlink builds the remote account provider from configured endpoints, attaches WebSocket or gRPC/Laserstream pubsub clients, wires the FetchCloner, creates the removed-account subscription channel, and prepares program/account subscription management. These objects may exist before primary mode, but their side-effecting paths are mode-gated.
  2. During promotion, wait for the scheduler to become idle so no replica/standby work is still being processed.
  3. Reset the account bank as primary-readiness cleanup. This removes non-delegated, non-blacklisted accounts from the local bank while preserving delegated, undelegating, ephemeral, and protected accounts.
  4. Switch the scheduler/global coordination state to Primary. At this point primary traffic can be accepted.
  5. Call chainlink.enable_primary(). The current implementation keeps this lightweight: it marks/logs the lifecycle boundary and relies on the CoordinationMode::Primary gate to allow remote Chainlink work.
  6. Once primary mode is visible, Chainlink remote paths are allowed again. Account ensure/fetch can use the configured remote account provider, clone missing state, subscribe to monitored accounts, set up program subscriptions, process clock/account updates, reconcile active account subscriptions, and submit eviction/undelegation-related work when those paths are requested.
  7. In other words, this PR guarantees that Chainlink side effects cannot happen before primary mode; it does not add a blocking warm-up phase that pre-establishes all subscriptions before primary mode is published.

Switching from Primary Mode

When a primary demotes into standby/replica behavior, the order is intentionally reversed around externally visible behavior: stop being primary first, then perform standby setup.

  1. Switch the scheduler/global coordination state to Replica immediately at the start of standby transition.
  2. Call chainlink.disable() to mark/log the non-primary lifecycle boundary.
  3. From that point, Chainlink side-effecting entry points are gated: account ensure returns no-op results, fetches read local AccountsDb, undelegation subscription setup is skipped, and removed-account eviction notifications are ignored instead of submitting eviction transactions.
  4. Transaction RPCs (sendTransaction and simulateTransaction) reject while non-primary, so they cannot trigger transaction preparation, account ensure, signature-cache mutation, or scheduler simulation.
  5. Account read RPCs remain local-only and do not invoke remote Chainlink ensure.
  6. Standby setup then creates the replica consumer and lock watcher. If that setup blocks or retries, the validator is already non-primary and cannot keep accepting primary-only Chainlink/RPC work.
  7. Existing provider objects, WebSocket/gRPC clients, program subscriptions, clock subscriptions, and subscription bookkeeping are not fully torn down by this PR; instead, their mutation/remote-sync effects are blocked by the coordination-mode gates until the validator becomes primary again.

magicblock-aperture

RPC dispatch now checks the current coordination mode. sendTransaction is rejected while non-primary before it can affect the transaction cache, and simulateTransaction is rejected before transaction preparation, account ensuring, or scheduler simulation. Account reads remain usable against local state. Tests cover non-primary transaction RPC rejection and local account reads.

magicblock-chainlink

Chainlink now has a central remote-sync gate based on CoordinationMode, plus lightweight lifecycle methods for primary enablement and disablement. Public side-effecting APIs skip remote work in replica mode, and tests verify ensure/fetch/undelegation behavior remains side-effect free while non-primary.

magicblock-api

Standalone validators now reset the account bank only as part of primary readiness, immediately before entering primary mode and enabling Chainlink lifecycle behavior. Replicated validators skip startup reset and defer cleanup until promotion.

magicblock-replicator

Promotion ordering is documented and covered: wait for scheduler idle, reset account bank, enter primary mode, then enable Chainlink lifecycle behavior. Demotion now switches to replica mode before standby setup can block or retry, preventing stale primary behavior from continuing during transition. Replica-only behavior is covered to ensure it cannot reach the primary path.

Summary by CodeRabbit

  • Bug Fixes

    • Write-like RPCs (send/simulate transactions) are rejected outside primary mode; replica reads return local account data.
  • Refactor

    • Startup/role-transition ordering and lifecycle sequencing for primary/replica modes improved.
  • New Features

    • Graceful shutdown and lifecycle controls added for remote sync and subscription components.
  • Tests

    • Expanded tests covering non-primary gating, lifecycle enable/disable, and shutdown behavior.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

Adds coordination-mode guards to HTTP handlers to reject write-like RPCs on replicas and short-circuits account/transaction ensure paths in replica mode. Refactors Chainlink into a runtime-backed lifecycle model with enable_primary/disable and runtime_fetch_cloner gating, and implements cooperative shutdown for FetchCloner and RemoteAccountProvider (including SubMuxClient and pubsub mocks). Reorders validator/replicator startup to run ordered primary/standby readiness sequences. Extensive tests and test-helper updates validate gating, lifecycle toggles, shutdowns, and sequencing.

Suggested reviewers

  • GabrielePicco
  • snawaz
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thlorenz/replica-aware-chainlink

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
magicblock-api/src/magic_validator.rs (1)

879-901: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't treat mode_tx.send(...) as a completed Primary transition.

This only queues SchedulerMode::Primary. The new HTTP/chainlink gates read CoordinationMode::current(), so startup can continue past enable_primary() while the old mode is still visible. Please make the coordination-mode switch synchronous here, or wait for an acknowledgment before enabling primary-only behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-api/src/magic_validator.rs` around lines 879 - 901, The code
currently only enqueues SchedulerMode::Primary via mode_tx.send(...) which
doesn't guarantee CoordinationMode::current() has switched before calling
chainlink.enable_primary(); change this to perform a synchronous
coordination-mode transition or wait for an explicit acknowledgment: either (a)
invoke the coordination-mode switch synchronously (so
CoordinationMode::current() reflects Primary before calling
chainlink.enable_primary()), or (b) extend the scheduler handshake to
return/await an ack after processing SchedulerMode::Primary (e.g., send a
request that the scheduler replies to) and only call chainlink.enable_primary()
after receiving that ack; update the logic around mode_tx.send,
SchedulerMode::Primary, and chainlink.enable_primary to ensure the coordination
mode is visible before enabling primary-only behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@magicblock-replicator/src/service/context.rs`:
- Around line 244-249: The promotion/demotion path currently enqueues
SchedulerMode inside enter_primary_mode()/enter_replica_mode() but silently
drops send failures, so the rest of
run_primary_readiness_sequence()/run_replica_readiness_sequence() can proceed
while CoordinationMode::current() is stale or the receiver is gone; change
enter_primary_mode and enter_replica_mode to return Result<(), Error> (or a
suitable error type) and have them propagate the failure of mode_tx.send(...)
instead of ignoring it, then update the calls in run_primary_readiness_sequence
and the analogous replica sequence to invoke those methods through the fallible
closures you already pass (e.g. replace || self.enter_primary_mode() with a
closure that returns the Result), so the readiness sequence aborts on
send/send-failure and the mode switch is externally visible; ensure
SchedulerMode send failures are converted into the same error type you return so
callers can handle them.

---

Outside diff comments:
In `@magicblock-api/src/magic_validator.rs`:
- Around line 879-901: The code currently only enqueues SchedulerMode::Primary
via mode_tx.send(...) which doesn't guarantee CoordinationMode::current() has
switched before calling chainlink.enable_primary(); change this to perform a
synchronous coordination-mode transition or wait for an explicit acknowledgment:
either (a) invoke the coordination-mode switch synchronously (so
CoordinationMode::current() reflects Primary before calling
chainlink.enable_primary()), or (b) extend the scheduler handshake to
return/await an ack after processing SchedulerMode::Primary (e.g., send a
request that the scheduler replies to) and only call chainlink.enable_primary()
after receiving that ack; update the logic around mode_tx.send,
SchedulerMode::Primary, and chainlink.enable_primary to ensure the coordination
mode is visible before enabling primary-only behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 43480cb3-ed7b-4cd8-8ece-e9efe4b990f5

📥 Commits

Reviewing files that changed from the base of the PR and between 25df83c and fc9b68b.

📒 Files selected for processing (10)
  • magicblock-aperture/src/requests/http/mod.rs
  • magicblock-aperture/src/requests/http/send_transaction.rs
  • magicblock-aperture/src/requests/http/simulate_transaction.rs
  • magicblock-api/src/magic_validator.rs
  • magicblock-chainlink/src/chainlink/mod.rs
  • magicblock-chainlink/tests/10_non_primary_gating.rs
  • magicblock-chainlink/tests/utils/test_context.rs
  • magicblock-replicator/src/error.rs
  • magicblock-replicator/src/service/context.rs
  • magicblock-replicator/src/service/mod.rs

Comment on lines +244 to +249
run_primary_readiness_sequence(
|| self.scheduler.wait_for_idle(),
|| self.chainlink.reset_accounts_bank().map_err(Into::into),
|| self.enter_primary_mode(),
|| self.chainlink.enable_primary().map_err(Into::into),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Promotion/demotion still race on stale coordination-mode visibility.

These helpers continue after enter_primary_mode() / enter_replica_mode(), but those methods only enqueue SchedulerMode on mode_tx and currently drop send failures. Since the new HTTP and chainlink gates consult CoordinationMode::current(), the node can keep behaving in the old mode—or proceed after the receiver is gone—while reset/enable/disable and consumer setup continue. Make the mode switch fallible and externally visible before running the rest of the sequence.

Also applies to: 264-268

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-replicator/src/service/context.rs` around lines 244 - 249, The
promotion/demotion path currently enqueues SchedulerMode inside
enter_primary_mode()/enter_replica_mode() but silently drops send failures, so
the rest of run_primary_readiness_sequence()/run_replica_readiness_sequence()
can proceed while CoordinationMode::current() is stale or the receiver is gone;
change enter_primary_mode and enter_replica_mode to return Result<(), Error> (or
a suitable error type) and have them propagate the failure of mode_tx.send(...)
instead of ignoring it, then update the calls in run_primary_readiness_sequence
and the analogous replica sequence to invoke those methods through the fallible
closures you already pass (e.g. replace || self.enter_primary_mode() with a
closure that returns the Result), so the readiness sequence aborts on
send/send-failure and the mode switch is externally visible; ensure
SchedulerMode send failures are converted into the same error type you return so
callers can handle them.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@magicblock-api/src/magic_validator.rs`:
- Around line 133-135: When calling enable_primary().await before
send_scheduler_primary().await, ensure you roll back the primary enable if
send_scheduler_primary() fails: after a failed send_scheduler_primary(), call
the complementary disable_primary().await (or the appropriate shutdown/disable
function corresponding to enable_primary) to revert the partial startup, log any
rollback errors, and then return the original send error; apply the same
rollback pattern to the other occurrence around lines 912-933 where
enable_primary/send_scheduler_primary are paired.

In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs`:
- Around line 308-313: The fail_all_pending_clones helper currently calls
.expect() on the pending_clones mutex which can panic during shutdown; change it
to handle a poisoned mutex instead—either change fail_all_pending_clones to
return Result<(), E> (propagate an error) or recover the lock with
lock().unwrap_or_else(|poisoned| poisoned.into_inner()) and log the poisoning
before proceeding; update uses of pending_clones in fail_all_pending_clones and
any callers accordingly and ensure you reference the pending_clones Mutex and
the fail_all_pending_clones function name when making the change.

In `@magicblock-chainlink/src/chainlink/mod.rs`:
- Around line 610-620: The method runtime_fetch_cloner currently acquires
self.runtime.lock().await before checking lifecycle_state, which can block; add
a fast-path check using
ChainlinkLifecycleState::from_u8(self.lifecycle_state.load(Ordering::SeqCst)) !=
ChainlinkLifecycleState::Enabled to return None immediately if not Enabled, then
after acquiring the runtime guard re-check the lifecycle_state to avoid a race
handing out a FetchCloner from runtime.fetch_cloner.clone(); update
runtime_fetch_cloner to perform (1) an early lifecycle check, (2) lock
self.runtime, and (3) a second lifecycle check before returning the cloned
FetchCloner.

In `@magicblock-chainlink/src/remote_account_provider/mod.rs`:
- Around line 871-884: The teardown code currently calls .expect("... lock
poisoned") on the removed_account_rx and fetching_accounts mutexes (via
removed_account_rx.lock() and fetching_accounts.lock()), which can panic;
replace these expect() calls with handling that recovers from a poisoned mutex
or propagates the error: use match or .lock().map_err(...) and either call
e.into_inner() to retrieve the inner value and continue cleanup, or
return/propagate the error from the teardown function (or log and continue) so
the process does not abort; update the blocks referencing removed_account_rx and
fetching_accounts to use this non-panicking behavior.
- Around line 824-899: Shutdown does not stop the fire-and-forget
spawn_deferred_pubsub_clients task, allowing late connections to reattach
clients via SubMuxClient::add_client; fix it by having
spawn_deferred_pubsub_clients return and store a JoinHandle (e.g.,
deferred_pubsub_task_handle) on the RemoteAccountProvider struct and/or make
that task check the existing shutdown_token before attaching new clients
(SubMuxClient::add_client), then in RemoteAccountProvider::shutdown() cancel the
shutdown_token (already present) and also take/timeout/abort and await the
deferred_pubsub_task_handle similar to account_updates_task_handle and
active_subscriptions_task_handle so the background task cannot resubscribe after
shutdown.

In `@magicblock-chainlink/src/submux/mod.rs`:
- Around line 948-964: The shutdown() function currently uses .expect() on three
mutexes (program_subs, dedup_cache, debounce_states) which can panic if
poisoned; change these to non-panicking handling: replace each
.lock().expect("...") call with code that matches on the Result from .lock(),
and either (a) recover the poisoned mutex by calling e.into_inner() and proceed
(optionally logging the poison), or (b) convert the poison into a
RemoteAccountProviderResult error and return early—choose one consistent
approach; ensure
AccountSubscriptionTask::Shutdown.process(self.clients_snapshot()).await and
shutdown_token.cancel() remain unaffected and include contextual logging when
recovering or returning an error so shutdown() does not abort due to a poisoned
mutex.

In `@magicblock-chainlink/tests/10_non_primary_gating.rs`:
- Around line 144-150: The loop currently iterates modes but never applies them
— change the test to actually configure the startup/coordination mode per
iteration by either adding a parameter to new_endpoint_chainlink(mode) or by
calling a setter on the returned chainlink (e.g., invoke a method like
chainlink.set_mode(mode) or configure the coordination/startup before running
assertions); ensure you use the mode string from the loop when
creating/configuring the endpoint so that each iteration exercises Replica,
ReplicaOnly, and StandBy behavior prior to calling
assert_no_runtime_or_remote_work and assert_public_calls_do_not_start_runtime.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3b4dbb6b-a8d1-4ece-b4ad-4f096f6c9596

📥 Commits

Reviewing files that changed from the base of the PR and between fc9b68b and 7946cc0.

📒 Files selected for processing (19)
  • magicblock-api/src/magic_validator.rs
  • magicblock-chainlink/src/chainlink/errors.rs
  • magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs
  • magicblock-chainlink/src/chainlink/fetch_cloner/tests.rs
  • magicblock-chainlink/src/chainlink/mod.rs
  • magicblock-chainlink/src/remote_account_provider/chain_laser_actor/actor.rs
  • magicblock-chainlink/src/remote_account_provider/chain_laser_actor/stream_manager.rs
  • magicblock-chainlink/src/remote_account_provider/chain_pubsub_actor.rs
  • magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs
  • magicblock-chainlink/src/remote_account_provider/chain_updates_client.rs
  • magicblock-chainlink/src/remote_account_provider/endpoint.rs
  • magicblock-chainlink/src/remote_account_provider/mod.rs
  • magicblock-chainlink/src/remote_account_provider/tests.rs
  • magicblock-chainlink/src/submux/mod.rs
  • magicblock-chainlink/tests/09_waiter_reconciliation_race.rs
  • magicblock-chainlink/tests/10_non_primary_gating.rs
  • magicblock-chainlink/tests/11_lifecycle_shutdown.rs
  • magicblock-chainlink/tests/utils/test_context.rs
  • magicblock-replicator/src/service/context.rs

Comment on lines +133 to +135
enable_primary().await?;
send_scheduler_primary().await?;
Ok(())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add rollback if scheduler-primary publish fails after enable_primary().

Current ordering can return Err with Chainlink already enabled, leaving a partial startup state (runtime active, scheduler not in primary mode).

Suggested rollback pattern
 async fn run_standalone_primary_start_sequence<
     ResetBank,
     EnablePrimary,
     EnablePrimaryFuture,
+    DisablePrimary,
+    DisablePrimaryFuture,
     SendSchedulerPrimary,
     SendSchedulerPrimaryFuture,
 >(
     should_reset_accounts_bank: bool,
     reset_bank: ResetBank,
     enable_primary: EnablePrimary,
+    disable_primary: DisablePrimary,
     send_scheduler_primary: SendSchedulerPrimary,
 ) -> ApiResult<()>
 where
     ResetBank: FnOnce() -> ApiResult<()>,
     EnablePrimary: FnOnce() -> EnablePrimaryFuture,
     EnablePrimaryFuture: Future<Output = ApiResult<()>>,
+    DisablePrimary: FnOnce() -> DisablePrimaryFuture,
+    DisablePrimaryFuture: Future<Output = ApiResult<()>>,
     SendSchedulerPrimary: FnOnce() -> SendSchedulerPrimaryFuture,
     SendSchedulerPrimaryFuture: Future<Output = ApiResult<()>>,
 {
     if should_reset_accounts_bank {
         reset_bank()?;
     }
     enable_primary().await?;
-    send_scheduler_primary().await?;
+    if let Err(err) = send_scheduler_primary().await {
+        let _ = disable_primary().await; // best-effort rollback
+        return Err(err);
+    }
     Ok(())
 }
             run_standalone_primary_start_sequence(
                 !self.config.accountsdb.reset,
                 || {
                     let step_start = Instant::now();
                     self.chainlink.reset_accounts_bank()?;
                     log_timing("startup", "reset_accounts_bank", step_start);
                     Ok(())
                 },
                 || async {
                     self.chainlink.enable_primary().await.map_err(Into::into)
                 },
+                || async {
+                    self.chainlink.disable().await.map_err(Into::into)
+                },
                 || async {
                     self.mode_tx.send(SchedulerMode::Primary).await.map_err(
                         |e| {
                             ApiError::FailedToSendModeSwitch(format!(
                                 "Failed to send primary mode to scheduler: {e}"
                             ))
                         },
                     )
                 },
             )

Also applies to: 912-933

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-api/src/magic_validator.rs` around lines 133 - 135, When calling
enable_primary().await before send_scheduler_primary().await, ensure you roll
back the primary enable if send_scheduler_primary() fails: after a failed
send_scheduler_primary(), call the complementary disable_primary().await (or the
appropriate shutdown/disable function corresponding to enable_primary) to revert
the partial startup, log any rollback errors, and then return the original send
error; apply the same rollback pattern to the other occurrence around lines
912-933 where enable_primary/send_scheduler_primary are paired.

Comment on lines +308 to +313
fn fail_all_pending_clones(&self) {
let pending = {
let mut map = self
.pending_clones
.lock()
.expect("pending_clones mutex poisoned");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t panic in the shutdown path.

Line 313 can crash validator teardown if pending_clones was poisoned. This helper should return an error or explicitly recover the inner state instead of aborting the process. As per coding guidelines, Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs` around lines 308 -
313, The fail_all_pending_clones helper currently calls .expect() on the
pending_clones mutex which can panic during shutdown; change it to handle a
poisoned mutex instead—either change fail_all_pending_clones to return
Result<(), E> (propagate an error) or recover the lock with
lock().unwrap_or_else(|poisoned| poisoned.into_inner()) and log the poisoning
before proceeding; update uses of pending_clones in fail_all_pending_clones and
any callers accordingly and ensure you reference the pending_clones Mutex and
the fail_all_pending_clones function name when making the change.

Comment on lines +610 to +620
async fn runtime_fetch_cloner(
&self,
) -> Option<Arc<FetchCloner<T, U, V, C>>> {
let runtime = self.runtime.lock().await;
if ChainlinkLifecycleState::from_u8(
self.lifecycle_state.load(Ordering::SeqCst),
) != ChainlinkLifecycleState::Enabled
{
return None;
}
runtime.as_ref().map(|runtime| runtime.fetch_cloner.clone())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# Check the file exists and get basic info
wc -l magicblock-chainlink/src/chainlink/mod.rs

Repository: magicblock-labs/magicblock-validator

Length of output: 126


🏁 Script executed:

# View the runtime_fetch_cloner method (around lines 610-620)
sed -n '600,630p' magicblock-chainlink/src/chainlink/mod.rs

Repository: magicblock-labs/magicblock-validator

Length of output: 1066


🏁 Script executed:

# View the enable_primary method (around lines 697-775)
sed -n '690,780p' magicblock-chainlink/src/chainlink/mod.rs

Repository: magicblock-labs/magicblock-validator

Length of output: 3058


🏁 Script executed:

# Search for any .unwrap() or .expect() calls in the file
rg -n '\.(unwrap|expect)\(' magicblock-chainlink/src/chainlink/mod.rs

Repository: magicblock-labs/magicblock-validator

Length of output: 62


Check lifecycle state before acquiring the runtime mutex to avoid blocking on state transitions.

runtime_fetch_cloner() blocks waiting for self.runtime.lock() before checking whether the lifecycle is Enabled. Since enable_primary() holds that same lock during the async create_runtime_from_build_config() call, any callers of fetch_accounts() or ensure_accounts() will stall in the Starting state instead of immediately returning None. Add a fast-path check of lifecycle_state before locking, then re-check after acquiring the guard to prevent a concurrent demotion from handing out a FetchCloner.

Suggested fix
 async fn runtime_fetch_cloner(
     &self,
 ) -> Option<Arc<FetchCloner<T, U, V, C>>> {
+    if ChainlinkLifecycleState::from_u8(
+        self.lifecycle_state.load(Ordering::SeqCst),
+    ) != ChainlinkLifecycleState::Enabled
+    {
+        return None;
+    }
     let runtime = self.runtime.lock().await;
     if ChainlinkLifecycleState::from_u8(
         self.lifecycle_state.load(Ordering::SeqCst),
     ) != ChainlinkLifecycleState::Enabled
     {
         return None;
     }
     runtime.as_ref().map(|runtime| runtime.fetch_cloner.clone())
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-chainlink/src/chainlink/mod.rs` around lines 610 - 620, The method
runtime_fetch_cloner currently acquires self.runtime.lock().await before
checking lifecycle_state, which can block; add a fast-path check using
ChainlinkLifecycleState::from_u8(self.lifecycle_state.load(Ordering::SeqCst)) !=
ChainlinkLifecycleState::Enabled to return None immediately if not Enabled, then
after acquiring the runtime guard re-check the lifecycle_state to avoid a race
handing out a FetchCloner from runtime.fetch_cloner.clone(); update
runtime_fetch_cloner to perform (1) an early lifecycle check, (2) lock
self.runtime, and (3) a second lifecycle check before returning the cloned
FetchCloner.

Comment on lines +824 to +899
pub async fn shutdown(&self) -> RemoteAccountProviderResult<()> {
self.shutdown_token.cancel();

let shutdown_result = match self.pubsub_client.shutdown().await {
Ok(()) => Ok(()),
Err(err) => {
warn!(error = %err, "Remote account pubsub shutdown failed");
Err(err)
}
};

if let Some(mut handle) =
self.account_updates_task_handle.lock().await.take()
{
match time::timeout(Duration::from_secs(2), &mut handle).await {
Ok(Ok(())) => {}
Ok(Err(err)) => {
warn!(error = %err, "Account updates task failed during shutdown");
}
Err(_) => {
warn!(
"Account updates task did not stop in time; aborting"
);
handle.abort();
let _ = handle.await;
}
}
}

if let Some(mut handle) =
self.active_subscriptions_task_handle.lock().await.take()
{
match time::timeout(Duration::from_secs(2), &mut handle).await {
Ok(Ok(())) => {}
Ok(Err(err)) => {
warn!(error = %err, "Active subscriptions task failed during shutdown");
}
Err(_) => {
warn!(
"Active subscriptions task did not stop in time; aborting"
);
handle.abort();
let _ = handle.await;
}
}
}

if let Some(mut removed_account_rx) = self
.removed_account_rx
.lock()
.expect("removed_account_rx lock poisoned")
.take()
{
while removed_account_rx.try_recv().is_ok() {}
}

{
let mut fetching = self
.fetching_accounts
.lock()
.expect("fetching_accounts lock poisoned");
for (_, state) in fetching.drain() {
for waiter in state.waiters {
let _ = waiter.send(Err(
RemoteAccountProviderError::AccountSubscriptionsTaskFailed(
"remote account provider shut down".to_string(),
),
));
}
}
}

self.subscription_ownership.lock().await.clear();
self.subscription_key_locks.lock().await.clear();

shutdown_result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Shutdown still leaves deferred pubsub attachment alive.

spawn_deferred_pubsub_clients() is fire-and-forget, and this shutdown path never cancels or joins it. A connection that resolves after Line 825 can still attach a new client and resubscribe accounts/programs through SubMuxClient::add_client(), recreating remote side effects after the provider was disabled.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-chainlink/src/remote_account_provider/mod.rs` around lines 824 -
899, Shutdown does not stop the fire-and-forget spawn_deferred_pubsub_clients
task, allowing late connections to reattach clients via
SubMuxClient::add_client; fix it by having spawn_deferred_pubsub_clients return
and store a JoinHandle (e.g., deferred_pubsub_task_handle) on the
RemoteAccountProvider struct and/or make that task check the existing
shutdown_token before attaching new clients (SubMuxClient::add_client), then in
RemoteAccountProvider::shutdown() cancel the shutdown_token (already present)
and also take/timeout/abort and await the deferred_pubsub_task_handle similar to
account_updates_task_handle and active_subscriptions_task_handle so the
background task cannot resubscribe after shutdown.

Comment on lines +871 to +884
if let Some(mut removed_account_rx) = self
.removed_account_rx
.lock()
.expect("removed_account_rx lock poisoned")
.take()
{
while removed_account_rx.try_recv().is_ok() {}
}

{
let mut fetching = self
.fetching_accounts
.lock()
.expect("fetching_accounts lock poisoned");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid expect() during provider teardown.

Lines 874 and 884 can turn a best-effort shutdown into a process panic if either mutex was poisoned. Propagate the failure or explicitly recover the inner state instead. As per coding guidelines, Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-chainlink/src/remote_account_provider/mod.rs` around lines 871 -
884, The teardown code currently calls .expect("... lock poisoned") on the
removed_account_rx and fetching_accounts mutexes (via removed_account_rx.lock()
and fetching_accounts.lock()), which can panic; replace these expect() calls
with handling that recovers from a poisoned mutex or propagates the error: use
match or .lock().map_err(...) and either call e.into_inner() to retrieve the
inner value and continue cleanup, or return/propagate the error from the
teardown function (or log and continue) so the process does not abort; update
the blocks referencing removed_account_rx and fetching_accounts to use this
non-panicking behavior.

Comment on lines +948 to +964
pub async fn shutdown(&self) -> RemoteAccountProviderResult<()> {
self.shutdown_token.cancel();
let result = AccountSubscriptionTask::Shutdown
.process(self.clients_snapshot())
.await;
self.program_subs
.lock()
.expect("program_subs lock poisoned")
.clear();
self.dedup_cache
.lock()
.expect("dedup_cache lock poisoned")
.clear();
self.debounce_states
.lock()
.expect("debounce_states lock poisoned")
.clear();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t let mux shutdown panic on poisoned state.

Lines 955, 959, and 963 can abort shutdown instead of completing it. Return an error or explicitly recover the poisoned mutex state here. As per coding guidelines, Treat any usage of .unwrap() or .expect() in production Rust code as a MAJOR issue.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-chainlink/src/submux/mod.rs` around lines 948 - 964, The
shutdown() function currently uses .expect() on three mutexes (program_subs,
dedup_cache, debounce_states) which can panic if poisoned; change these to
non-panicking handling: replace each .lock().expect("...") call with code that
matches on the Result from .lock(), and either (a) recover the poisoned mutex by
calling e.into_inner() and proceed (optionally logging the poison), or (b)
convert the poison into a RemoteAccountProviderResult error and return
early—choose one consistent approach; ensure
AccountSubscriptionTask::Shutdown.process(self.clients_snapshot()).await and
shutdown_token.cancel() remain unaffected and include contextual logging when
recovering or returning an error so shutdown() does not abort due to a poisoned
mutex.

Comment on lines +144 to +150
for mode in ["Replica", "ReplicaOnly", "StandBy"] {
let (chainlink, bank) = new_endpoint_chainlink().await;
assert_no_runtime_or_remote_work(&chainlink).await;
assert_public_calls_do_not_start_runtime(&chainlink, &bank).await;
assert_no_runtime_or_remote_work(&chainlink).await;
drop((mode, chainlink));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mode loop does not actually test different startup modes.

mode is never applied (only dropped), so this runs the same scenario three times and does not validate Replica/ReplicaOnly/StandBy-specific behavior. Please set the coordination/startup mode per iteration (or parameterize new_endpoint_chainlink with mode) before assertions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@magicblock-chainlink/tests/10_non_primary_gating.rs` around lines 144 - 150,
The loop currently iterates modes but never applies them — change the test to
actually configure the startup/coordination mode per iteration by either adding
a parameter to new_endpoint_chainlink(mode) or by calling a setter on the
returned chainlink (e.g., invoke a method like chainlink.set_mode(mode) or
configure the coordination/startup before running assertions); ensure you use
the mode string from the loop when creating/configuring the endpoint so that
each iteration exercises Replica, ReplicaOnly, and StandBy behavior prior to
calling assert_no_runtime_or_remote_work and
assert_public_calls_do_not_start_runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant