Skip to content

fix(ci): wait for container removal before docker compose up#104

Merged
samtuckerdavis merged 2 commits intomainfrom
fix/prod-deploy-container-removal-race
Apr 22, 2026
Merged

fix(ci): wait for container removal before docker compose up#104
samtuckerdavis merged 2 commits intomainfrom
fix/prod-deploy-container-removal-race

Conversation

@samtuckerdavis
Copy link
Copy Markdown
Contributor

@samtuckerdavis samtuckerdavis commented Apr 22, 2026

Summary

  • Fixes the production Deploy workflow, which has failed 8+ consecutive times on main with a Docker container-name conflict.
  • Adds a wait loop after docker compose down / docker rm -f so docker compose up -d doesn't race with an in-flight container removal.

Root cause

Every recent failure (runs 24698290423, 24698091488, 24698090290, ...) dies with one of:

Container cryptpad-cryptpad-1  Error response from daemon: removal of container <id> is already in progress
Container cryptpad-cryptpad-1  Error response from daemon: Conflict. The container name "/cryptpad-cryptpad-1" is already in use ...

Both docker compose down --timeout 30 and docker rm -f return to the shell as soon as the daemon accepts the request; actual removal of the filesystem/network resources can still be in progress. The immediately-following docker compose up -d then collides with its own in-flight teardown. The existing || true and 2>/dev/null || true masked the cleanup errors but not the subsequent up failure.

Fix

After the existing down + rm -f, poll docker ps -a for up to 60s until the container name is actually gone, then run up. If the container is still present after the timeout, abort the deploy with a clear error instead of silently racing.

  • No change to deploy target (still 64.227.144.36 / /opt/cryptpad).
  • No change to secrets or auth (still DEPLOY_SSH_KEY).
  • No checks disabled; the new path adds a stricter fail-closed guard.

Note on verification

This workflow only runs on push to main, so it will not run against this PR. The fix will take effect on merge. Behavior under the failure mode is exercised by the new wait loop on the first deploy after merge. If the stuck container persists beyond 60s, the deploy will now fail fast with a clear error surfacing the daemon state (docker ps -a --filter name=cryptpad-cryptpad-1), rather than a confusing conflict from up.

Test plan

  • Merge to main and observe the next production deploy run.
  • Confirm the "Waiting for cryptpad-cryptpad-1 to be removed" lines appear if there's a stuck container, and that up succeeds after the wait.
  • If the timeout triggers instead, investigate the Docker daemon state on the droplet (not a regression — this surfaces pre-existing breakage more clearly than before).

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Summary by CodeRabbit

  • Chores
    • Improved deployment reliability by adding a verification step that waits for and confirms container removal before bringing services back up; deployment aborts with diagnostics if removal fails.
    • Added configuration to exclude vendored third-party and special-purpose files from repository scanning to reduce false positives.

The production deploy has failed 8+ times in a row with either
"removal of container ... is already in progress" or
"container name /cryptpad-cryptpad-1 is already in use".

Root cause: `docker compose down` and `docker rm -f` both return
before the Docker daemon finishes the async container removal.
The immediately-following `docker compose up -d` then races with
that in-flight removal and fails.

Fix: after the down/rm step, poll `docker ps -a` for up to 60s
until the container name is actually gone, then run `up`. If the
container is still present after the timeout, abort the deploy
with a clear error instead of masking the problem.

No change to deploy target, secrets, or auth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

AI Code Review\n\nReview unavailable.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 09d23330-fed8-40f0-8762-345056277068

📥 Commits

Reviewing files that changed from the base of the PR and between fb67604 and a99a48a.

📒 Files selected for processing (1)
  • .wokeignore
✅ Files skipped from review due to trivial changes (1)
  • .wokeignore

📝 Walkthrough

Walkthrough

Added a polling/validation step to the SSH deploy workflow that waits up to 60 seconds for cryptpad-cryptpad-1 to be removed before running docker compose up -d. Also added a .wokeignore file listing scanner ignore patterns for vendored and upstream files.

Changes

Cohort / File(s) Summary
CI/CD Deployment Validation
.github/workflows/deploy.yml
Inserted a post-removal wait-and-check loop that polls docker ps -a for cryptpad-cryptpad-1 (30 attempts, 2s sleep). If the container remains after timeout, logs matching docker ps output and exits 1; otherwise continues to docker compose up -d.
Static Analysis Ignore Rules
.wokeignore
Added scanner ignore patterns excluding vendored third-party directories under www/lib/, specific scripts, upstream CryptPad source paths (lib/, scripts/tests/, src/common/), CHANGELOG.md, and .claude/.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately summarizes the main fix: adding a wait-and-check mechanism for container removal before running docker compose up, which is the core change in the deploy workflow.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/prod-deploy-container-removal-race

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/deploy.yml:
- Around line 48-58: The current loop and guard use a piped command like `docker
ps -a --format '{{.Names}}' | grep -Fxq 'cryptpad-cryptpad-1'` which can treat a
failing `docker ps` as "not found"; change both uses to first run `docker ps -a
--format '{{.Names}}'` and capture its exit status and output into a variable,
immediately fail the script if `docker ps` returns non-zero, and only then run
`grep -Fxq 'cryptpad-cryptpad-1'` against the captured output; update the
waiting loop condition (the `while`/`break` logic) and the final guard `if` that
checks presence of `cryptpad-cryptpad-1` to use the captured output variable and
the docker exit check so daemon/query failures are detected and the job aborts
rather than being misinterpreted as "container removed".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eadbc37c-e03e-4ecd-a79d-c5265237061c

📥 Commits

Reviewing files that changed from the base of the PR and between 8554cf7 and fb67604.

📒 Files selected for processing (1)
  • .github/workflows/deploy.yml

Comment on lines +48 to +58
if ! docker ps -a --format '{{.Names}}' | grep -Fxq 'cryptpad-cryptpad-1'; then
break
fi
echo "[$(date)] Waiting for cryptpad-cryptpad-1 to be removed ($i/30)..."
sleep 2
done
if docker ps -a --format '{{.Names}}' | grep -Fxq 'cryptpad-cryptpad-1'; then
echo "[$(date)] ERROR: cryptpad-cryptpad-1 still present after 60s; aborting deploy." >&2
docker ps -a --filter name=cryptpad-cryptpad-1 >&2
exit 1
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Reproducing conditional behavior similar to Line 48/54:"
mock_docker_ps() { return 125; } # simulate daemon/query error

if ! mock_docker_ps | grep -Fxq 'cryptpad-cryptpad-1'; then
  echo "Observed: condition evaluates as absent even though query failed (buggy behavior)."
else
  echo "Unexpected."
fi

Repository: Open-Paws/cryptpad-project-management

Length of output: 219


🏁 Script executed:

fd -type f -name "deploy.yml" | head -5

Repository: Open-Paws/cryptpad-project-management

Length of output: 251


🏁 Script executed:

fd -t f "deploy.yml"

Repository: Open-Paws/cryptpad-project-management

Length of output: 63


🏁 Script executed:

git ls-files ".github/workflows/" | head -20

Repository: Open-Paws/cryptpad-project-management

Length of output: 314


🏁 Script executed:

wc -l .github/workflows/deploy.yml

Repository: Open-Paws/cryptpad-project-management

Length of output: 112


🏁 Script executed:

cat -n .github/workflows/deploy.yml

Repository: Open-Paws/cryptpad-project-management

Length of output: 2623


docker ps failures can be misclassified as "container removed".

At Line 48 and Line 54, the pipeline can pass on daemon/query failure (not just true absence), which may allow docker compose up to run despite an unhealthy Docker state. If docker ps fails, grep receives empty input and returns non-match, which ! converts to success in the loop condition or masks the failure in the guard condition.

Make the docker ps call explicit and fail immediately if it errors by capturing its output separately before grepping:

Proposed fix
+            container_name='cryptpad-cryptpad-1'
             for i in $(seq 1 30); do
-              if ! docker ps -a --format '{{.Names}}' | grep -Fxq 'cryptpad-cryptpad-1'; then
+              if ! names="$(docker ps -a --format '{{.Names}}')"; then
+                echo "[$(date)] ERROR: failed to query Docker daemon while waiting for ${container_name} removal." >&2
+                exit 1
+              fi
+              if ! grep -Fxq "${container_name}" <<<"${names}"; then
                 break
               fi
-              echo "[$(date)] Waiting for cryptpad-cryptpad-1 to be removed ($i/30)..."
+              echo "[$(date)] Waiting for ${container_name} to be removed ($i/30)..."
               sleep 2
             done
-            if docker ps -a --format '{{.Names}}' | grep -Fxq 'cryptpad-cryptpad-1'; then
-              echo "[$(date)] ERROR: cryptpad-cryptpad-1 still present after 60s; aborting deploy." >&2
-              docker ps -a --filter name=cryptpad-cryptpad-1 >&2
+            if ! names="$(docker ps -a --format '{{.Names}}')"; then
+              echo "[$(date)] ERROR: failed to query Docker daemon before final container presence check." >&2
+              exit 1
+            fi
+            if grep -Fxq "${container_name}" <<<"${names}"; then
+              echo "[$(date)] ERROR: ${container_name} still present after 60s; aborting deploy." >&2
+              docker ps -a --filter "name=^/${container_name}$" >&2
               exit 1
             fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy.yml around lines 48 - 58, The current loop and
guard use a piped command like `docker ps -a --format '{{.Names}}' | grep -Fxq
'cryptpad-cryptpad-1'` which can treat a failing `docker ps` as "not found";
change both uses to first run `docker ps -a --format '{{.Names}}'` and capture
its exit status and output into a variable, immediately fail the script if
`docker ps` returns non-zero, and only then run `grep -Fxq
'cryptpad-cryptpad-1'` against the captured output; update the waiting loop
condition (the `while`/`break` logic) and the final guard `if` that checks
presence of `cryptpad-cryptpad-1` to use the captured output variable and the
docker exit check so daemon/query failures are detected and the job aborts
rather than being misinterpreted as "container removed".

The `check` workflow runs woke on the whole tree and flags findings in
vendored third-party libraries (mermaid, pdf.js, asciidoctor, etc.) and
upstream CryptPad source files. Those are not owned in this fork and
rewording them would drift from upstream. Add a `.wokeignore` that the
action honours to scope the scan to code actually authored here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

AI Code Review\n\nReview unavailable.

@samtuckerdavis samtuckerdavis merged commit b449606 into main Apr 22, 2026
7 checks passed
@samtuckerdavis samtuckerdavis deleted the fix/prod-deploy-container-removal-race branch April 22, 2026 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant