Add 2PC implementation plan with corrected protocol

cloutiertyler · cloutiertyler · commit f9fdcf9e1537 · 2026-03-29T01:00:11.000-04:00
Documents the full pipelined 2PC protocol for coordinator and participant,
including the persistence barrier, serializable isolation (participant holds
MutTxId across all calls in a coordinator transaction), two-phase participant
response (immediate result + deferred PREPARED after durability), abort
paths, commitlog format, and replay semantics.

Identifies the open problem: MutTxId is !Send but must be held across
multiple HTTP requests on the participant side.
diff --git a/crates/core/2PC-IMPLEMENTATION-PLAN.md b/crates/core/2PC-IMPLEMENTATION-PLAN.md
@@ -0,0 +1,105 @@
+# 2PC Implementation Plan (Pipelined)
+
+## Context
+
+The TPC-C benchmark on branch `origin/phoebe/tpcc/reducer-return-value` (public submodule) uses non-atomic HTTP calls for cross-database operations. We need 2PC so distributed transactions either commit on both databases or neither. Pipelined 2PC is chosen because it avoids blocking on persistence during lock-holding, and the codebase already separates in-memory commit from durability.
+
+## Protocol (Corrected)
+
+### Participant happy path:
+
+1. Receive CALL from coordinator (reducer name + args)
+2. Execute reducer (write lock held)
+3. Return result to coordinator (write lock still held, transaction still open)
+4. Possibly receive more CALLs from coordinator (same transaction, same write lock)
+5. Receive END_CALLS from coordinator ("no more reducer calls in this transaction")
+6. Commit in-memory (release write lock)
+7. Send PREPARE to durability worker
+8. **Barrier up** -- no more durability requests go through
+9. In background: wait for PREPARE to be durable
+10. Once durable: send PREPARED to coordinator
+11. Wait for COMMIT or ABORT from coordinator
+12. Receive COMMIT
+13. Send COMMIT to durability worker
+14. **Barrier down** -- flush buffered requests
+
+### Coordinator happy path:
+
+1. Execute reducer, calling participant reducers along the way (participants hold write locks, return results, but don't commit)
+2. Reducer succeeds
+3. Send END_CALLS to all participants (they can now commit in-memory)
+4. Commit coordinator in-memory (release write lock)
+5. Send PREPARE to durability worker
+6. **Barrier up** -- no more durability requests go through
+7. Wait for coordinator's own PREPARE to be durable
+8. Wait for all participants to report PREPARED
+9. Send COMMIT to all participants
+10. Send COMMIT to durability worker
+11. **Barrier down** -- flush buffered requests
+
+### Key correctness properties:
+
+- **Serializable isolation**: Participant holds write lock from CALL through END_CALLS. Multiple CALLs from the same coordinator transaction execute within the same MutTxId on the participant. The second call sees the first call's writes.
+- **Persistence barrier**: After PREPARE is sent to durability (step 7/8 on participant, step 5/6 on coordinator), no speculative transactions can reach the durability worker until COMMIT or ABORT. Anything sent to the durability worker can eventually become persistent, so the barrier is required.
+- **Two responses from participant**: The immediate result (step 3) and the later PREPARED notification (step 10). The coordinator collects both: results during reducer execution, PREPARED notifications before deciding COMMIT.
+- **Pipelining benefit**: Locks are held only during reducer execution (steps 1-6), not during persistence (steps 7-14). The persistence and 2PC handshake happen after locks are released on both sides.
+
+### Abort paths:
+
+**Coordinator's reducer fails (step 2):**
+- Send ABORT to all participants (they still hold write locks)
+- Participants rollback their MutTxId (release write lock, no changes)
+- No PREPARE was sent, no barrier needed
+
+**Participant's reducer fails (step 2):**
+- Participant returns error to coordinator
+- Coordinator's reducer fails (propagates error)
+- Coordinator sends ABORT to all other participants that succeeded
+- Those participants rollback their MutTxId
+
+**Coordinator's PREPARE persists but a participant's PREPARE fails to persist:**
+- Participant cannot send PREPARED
+- Coordinator times out waiting for PREPARED
+- Coordinator sends ABORT to all participants
+- Coordinator inverts its own in-memory state, discards buffered durability requests
+
+**Crash during protocol:**
+- See proposal §8 for recovery rules
+
+### Open problem: MutTxId is !Send
+
+The participant holds MutTxId across multiple HTTP requests (CALL, more CALLs, END_CALLS). MutTxId is !Send (holds SharedWriteGuard). Options:
+
+1. **Dedicated blocking thread per participant transaction**: spawn_blocking holds the MutTxId, communicates via channels. HTTP handlers send messages, blocking thread processes them.
+2. **Session-based protocol**: Participant creates a session on first CALL, routes subsequent CALLs and END_CALLS to the same thread/task that holds the MutTxId.
+3. **Batch all calls**: Coordinator sends all reducer calls + args in a single request. Participant executes them all, returns all results, then commits. Single HTTP round-trip, no cross-request MutTxId holding.
+
+Option 3 is simplest but limits the coordinator to not making decisions between calls. Option 1 is most general. TBD.
+
+## Commitlog format
+
+- PREPARE record: includes all row changes (inserts/deletes)
+- COMMIT record: follows PREPARE, marks transaction as committed
+- ABORT record: follows PREPARE, marks transaction as aborted
+- No other records can appear between PREPARE and COMMIT/ABORT in the durable log (persistence barrier enforces this)
+
+## Replay semantics
+
+On replay, when encountering a PREPARE:
+- Do not apply it to the datastore
+- Read the next record:
+  - COMMIT: apply the PREPARE's changes
+  - ABORT: skip the PREPARE
+  - No next record (crash): transaction is still in progress, wait for coordinator or timeout and abort
+
+## Key files
+
+- `crates/core/src/db/relational_db.rs` -- PersistenceBarrier, arm/deactivate, send_or_buffer_durability
+- `crates/core/src/host/prepared_tx.rs` -- PreparedTxInfo, PreparedTransactions registry
+- `crates/core/src/host/module_host.rs` -- prepare_reducer, commit_prepared, abort_prepared
+- `crates/core/src/host/wasm_common/module_host_actor.rs` -- coordinator post-commit coordination
+- `crates/core/src/host/instance_env.rs` -- call_reducer_on_db_2pc, prepared_participants tracking
+- `crates/core/src/host/wasmtime/wasm_instance_env.rs` -- WASM host function
+- `crates/client-api/src/routes/database.rs` -- HTTP endpoints
+- `crates/bindings-sys/src/lib.rs` -- FFI
+- `crates/bindings/src/remote_reducer.rs` -- safe wrapper
diff --git a/crates/core/src/db/relational_db.rs b/crates/core/src/db/relational_db.rs
@@ -96,12 +96,24 @@ pub struct PersistenceBarrier {
     inner: std::sync::Mutex<PersistenceBarrierInner>,
 }
 
+#[derive(Default, PartialEq, Eq, Debug, Clone, Copy)]
+enum BarrierState {
+    /// No 2PC in progress. Durability requests go through normally.
+    #[default]
+    Inactive,
+    /// A 2PC is about to commit. The NEXT durability request is the PREPARE
+    /// and should go through to the worker. After that request, the barrier
+    /// transitions to Active automatically.
+    Armed,
+    /// A 2PC PREPARE has been sent to durability. All subsequent durability
+    /// requests are buffered until the barrier is deactivated (COMMIT or ABORT).
+    Active,
+}
+
 #[derive(Default)]
 struct PersistenceBarrierInner {
-    /// If Some, a PREPARE is pending at this offset. All durability requests
-    /// are buffered until the barrier is lifted.
-    active_prepare: Option<TxOffset>,
-    /// Buffered durability requests that arrived while the barrier was active.
+    state: BarrierState,
+    /// Buffered durability requests that arrived while the barrier was Active.
     buffered: Vec<BufferedDurabilityRequest>,
 }
 
@@ -110,48 +122,64 @@ impl PersistenceBarrier {
         Self::default()
     }
 
-    /// Activate the barrier for a PREPARE at the given offset.
-    pub fn activate(&self, prepare_offset: TxOffset) {
+    /// Arm the barrier. The next durability request will go through (it's the
+    /// PREPARE), and then the barrier transitions to Active, buffering all
+    /// subsequent requests.
+    ///
+    /// This must be called BEFORE the transaction commits, while the write lock
+    /// is still held. This ensures no other transaction can send a durability
+    /// request between the PREPARE and the barrier activation.
+    pub fn arm(&self) {
         let mut inner = self.inner.lock().unwrap();
-        assert!(
-            inner.active_prepare.is_none(),
-            "persistence barrier already active at offset {:?}, cannot activate for {prepare_offset}",
-            inner.active_prepare,
+        assert_eq!(
+            inner.state,
+            BarrierState::Inactive,
+            "persistence barrier must be Inactive to arm, but is {:?}",
+            inner.state,
         );
-        inner.active_prepare = Some(prepare_offset);
+        inner.state = BarrierState::Armed;
         inner.buffered.clear();
     }
 
-    /// If the barrier is active, buffer the durability request and return None.
-    /// If the barrier is not active, return the arguments back unchanged.
-    pub fn try_buffer(
+    /// Called by `send_or_buffer_durability` for every durability request.
+    ///
+    /// Returns `Some(reducer_context)` if the request should be sent to the
+    /// durability worker (barrier is Inactive, or barrier is Armed and this is
+    /// the PREPARE). Returns `None` if the request was buffered (barrier is Active).
+    pub fn filter_durability_request(
         &self,
         reducer_context: Option<ReducerContext>,
         tx_data: &Arc<TxData>,
     ) -> Option<Option<ReducerContext>> {
         let mut inner = self.inner.lock().unwrap();
-        if inner.active_prepare.is_some() {
-            inner.buffered.push(BufferedDurabilityRequest {
-                reducer_context,
-                tx_data: tx_data.clone(),
-            });
-            None // buffered
-        } else {
-            Some(reducer_context) // not buffered, return back
+        match inner.state {
+            BarrierState::Inactive => {
+                // No barrier. Let it through.
+                Some(reducer_context)
+            }
+            BarrierState::Armed => {
+                // This is the PREPARE request. Let it through, then go Active.
+                inner.state = BarrierState::Active;
+                Some(reducer_context)
+            }
+            BarrierState::Active => {
+                // Buffer this request.
+                inner.buffered.push(BufferedDurabilityRequest {
+                    reducer_context,
+                    tx_data: tx_data.clone(),
+                });
+                None
+            }
         }
     }
 
     /// Deactivate the barrier and return the buffered requests.
+    /// Called on COMMIT (to flush them) or ABORT (to discard them).
     pub fn deactivate(&self) -> Vec<BufferedDurabilityRequest> {
         let mut inner = self.inner.lock().unwrap();
-        inner.active_prepare = None;
+        inner.state = BarrierState::Inactive;
         std::mem::take(&mut inner.buffered)
     }
-
-    /// Check if the barrier is currently active.
-    pub fn is_active(&self) -> bool {
-        self.inner.lock().unwrap().active_prepare.is_some()
-    }
 }
 
 /// We've added a module version field to the system tables, but we don't yet
@@ -924,52 +952,32 @@ impl RelationalDB {
 
     /// Send a durability request, or buffer it if the persistence barrier is active.
     fn send_or_buffer_durability(&self, reducer_context: Option<ReducerContext>, tx_data: &Arc<TxData>) {
-        match self.persistence_barrier.try_buffer(reducer_context, tx_data) {
-            None => {
-                // Buffered behind the persistence barrier; will be flushed on COMMIT
-                // or discarded on ABORT.
-            }
+        match self.persistence_barrier.filter_durability_request(reducer_context, tx_data) {
             Some(reducer_context) => {
-                // Not buffered (barrier not active). Send to durability worker.
+                // Either barrier is Inactive (normal path) or Armed (this is the PREPARE).
+                // Send to durability worker.
                 if let Some(durability) = &self.durability {
                     durability.request_durability(reducer_context, tx_data);
                 }
             }
+            None => {
+                // Buffered behind the persistence barrier (Active state).
+            }
         }
     }
 
-    /// Commit a transaction as a 2PC PREPARE: commit in-memory, send to
-    /// durability worker, and activate the persistence barrier.
+    /// Arm the persistence barrier for a 2PC PREPARE.
     ///
-    /// Returns the TxOffset and TxData. The caller should then wait for the
-    /// PREPARE to become durable (via `durable_tx_offset().wait_for(offset)`)
-    /// before sending PREPARED to the coordinator.
-    #[tracing::instrument(level = "trace", skip_all)]
-    pub fn commit_tx_prepare(
-        &self,
-        tx: MutTx,
-    ) -> Result<Option<(TxOffset, Arc<TxData>, TxMetrics, Option<ReducerName>)>, DBError> {
-        log::trace!("COMMIT MUT TX (2PC PREPARE)");
-
-        let reducer_context = tx.ctx.reducer_context().cloned();
-        let Some((tx_offset, tx_data, tx_metrics, reducer)) = self.inner.commit_mut_tx(tx)? else {
-            return Ok(None);
-        };
-
-        self.maybe_do_snapshot(&tx_data);
-
-        let tx_data = Arc::new(tx_data);
-
-        // Send the PREPARE to durability (bypassing the barrier, since this IS the prepare).
-        if let Some(durability) = &self.durability {
-            durability.request_durability(reducer_context.clone(), &tx_data);
-        }
-
-        // Activate the persistence barrier AFTER sending the PREPARE.
-        // All subsequent durability requests will be buffered.
-        self.persistence_barrier.activate(tx_offset);
-
-        Ok(Some((tx_offset, tx_data, tx_metrics, reducer)))
+    /// Call this BEFORE committing the transaction (while the write lock is
+    /// still held). The next durability request (the PREPARE) will go through
+    /// to the worker normally. After that, all subsequent durability requests
+    /// are buffered until `finalize_prepare_commit()` or `finalize_prepare_abort()`.
+    ///
+    /// This ensures no speculative transaction can reach the durability worker
+    /// between the PREPARE and the COMMIT/ABORT decision, even though the
+    /// write lock is released by `commit_tx_downgrade`.
+    pub fn arm_persistence_barrier(&self) {
+        self.persistence_barrier.arm();
     }
 
     /// Finalize a 2PC transaction as COMMIT.
diff --git a/crates/core/src/host/module_host.rs b/crates/core/src/host/module_host.rs
@@ -1810,11 +1810,6 @@ impl ModuleHost {
                 let _ = durable_offset.wait_for(current + 1).await;
             }
 
-            // PREPARE is now durable. Deactivate the barrier and flush all
-            // buffered speculative transactions to the durability worker.
-            // Subsequent transactions can persist normally until the next PREPARE.
-            self.relational_db().finalize_prepare_commit();
-
             Ok((prepare_id, result, return_value))
         } else {
             // Reducer failed -- no prepare_id since nothing to commit/abort.
@@ -1824,30 +1819,25 @@ impl ModuleHost {
 
     /// Finalize a prepared transaction as COMMIT.
     ///
-    /// The persistence barrier was already deactivated (and buffered requests
-    /// flushed) when the PREPARE became durable in `prepare_reducer`. This
-    /// method just removes the prepared tx from the registry.
-    ///
-    /// TODO: Write a COMMIT record to the commitlog so replay knows to apply
-    /// the PREPARE.
+    /// Deactivates the persistence barrier and flushes all buffered durability
+    /// requests to the durability worker.
     pub fn commit_prepared(&self, prepare_id: &str) -> Result<(), String> {
-        self.prepared_txs
+        let _info = self.prepared_txs
             .remove(prepare_id)
             .ok_or_else(|| format!("no such prepared transaction: {prepare_id}"))?;
+        self.relational_db().finalize_prepare_commit();
         Ok(())
     }
 
     /// Abort a prepared transaction.
     ///
-    /// Inverts the PREPARE's in-memory changes and writes an ABORT record
-    /// so replay knows to skip the PREPARE.
-    ///
-    /// TODO: Actually invert in-memory state and write ABORT to commitlog.
+    /// Deactivates the persistence barrier, discards all buffered durability
+    /// requests, and inverts the PREPARE's in-memory changes.
     pub fn abort_prepared(&self, prepare_id: &str) -> Result<(), String> {
-        let _info = self.prepared_txs
+        let info = self.prepared_txs
             .remove(prepare_id)
             .ok_or_else(|| format!("no such prepared transaction: {prepare_id}"))?;
-        log::warn!("2PC abort for {prepare_id}: in-memory inversion not yet implemented");
+        self.relational_db().finalize_prepare_abort(&info.tx_data);
         Ok(())
     }
 
diff --git a/crates/core/src/host/wasm_common/module_host_actor.rs b/crates/core/src/host/wasm_common/module_host_actor.rs