fix(ci): wait for container removal before docker compose up#104
fix(ci): wait for container removal before docker compose up#104samtuckerdavis merged 2 commits intomainfrom
Conversation
The production deploy has failed 8+ times in a row with either "removal of container ... is already in progress" or "container name /cryptpad-cryptpad-1 is already in use". Root cause: `docker compose down` and `docker rm -f` both return before the Docker daemon finishes the async container removal. The immediately-following `docker compose up -d` then races with that in-flight removal and fails. Fix: after the down/rm step, poll `docker ps -a` for up to 60s until the container name is actually gone, then run `up`. If the container is still present after the timeout, abort the deploy with a clear error instead of masking the problem. No change to deploy target, secrets, or auth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AI Code Review\n\nReview unavailable. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdded a polling/validation step to the SSH deploy workflow that waits up to 60 seconds for Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/deploy.yml:
- Around line 48-58: The current loop and guard use a piped command like `docker
ps -a --format '{{.Names}}' | grep -Fxq 'cryptpad-cryptpad-1'` which can treat a
failing `docker ps` as "not found"; change both uses to first run `docker ps -a
--format '{{.Names}}'` and capture its exit status and output into a variable,
immediately fail the script if `docker ps` returns non-zero, and only then run
`grep -Fxq 'cryptpad-cryptpad-1'` against the captured output; update the
waiting loop condition (the `while`/`break` logic) and the final guard `if` that
checks presence of `cryptpad-cryptpad-1` to use the captured output variable and
the docker exit check so daemon/query failures are detected and the job aborts
rather than being misinterpreted as "container removed".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: eadbc37c-e03e-4ecd-a79d-c5265237061c
📒 Files selected for processing (1)
.github/workflows/deploy.yml
| if ! docker ps -a --format '{{.Names}}' | grep -Fxq 'cryptpad-cryptpad-1'; then | ||
| break | ||
| fi | ||
| echo "[$(date)] Waiting for cryptpad-cryptpad-1 to be removed ($i/30)..." | ||
| sleep 2 | ||
| done | ||
| if docker ps -a --format '{{.Names}}' | grep -Fxq 'cryptpad-cryptpad-1'; then | ||
| echo "[$(date)] ERROR: cryptpad-cryptpad-1 still present after 60s; aborting deploy." >&2 | ||
| docker ps -a --filter name=cryptpad-cryptpad-1 >&2 | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Reproducing conditional behavior similar to Line 48/54:"
mock_docker_ps() { return 125; } # simulate daemon/query error
if ! mock_docker_ps | grep -Fxq 'cryptpad-cryptpad-1'; then
echo "Observed: condition evaluates as absent even though query failed (buggy behavior)."
else
echo "Unexpected."
fiRepository: Open-Paws/cryptpad-project-management
Length of output: 219
🏁 Script executed:
fd -type f -name "deploy.yml" | head -5Repository: Open-Paws/cryptpad-project-management
Length of output: 251
🏁 Script executed:
fd -t f "deploy.yml"Repository: Open-Paws/cryptpad-project-management
Length of output: 63
🏁 Script executed:
git ls-files ".github/workflows/" | head -20Repository: Open-Paws/cryptpad-project-management
Length of output: 314
🏁 Script executed:
wc -l .github/workflows/deploy.ymlRepository: Open-Paws/cryptpad-project-management
Length of output: 112
🏁 Script executed:
cat -n .github/workflows/deploy.ymlRepository: Open-Paws/cryptpad-project-management
Length of output: 2623
docker ps failures can be misclassified as "container removed".
At Line 48 and Line 54, the pipeline can pass on daemon/query failure (not just true absence), which may allow docker compose up to run despite an unhealthy Docker state. If docker ps fails, grep receives empty input and returns non-match, which ! converts to success in the loop condition or masks the failure in the guard condition.
Make the docker ps call explicit and fail immediately if it errors by capturing its output separately before grepping:
Proposed fix
+ container_name='cryptpad-cryptpad-1'
for i in $(seq 1 30); do
- if ! docker ps -a --format '{{.Names}}' | grep -Fxq 'cryptpad-cryptpad-1'; then
+ if ! names="$(docker ps -a --format '{{.Names}}')"; then
+ echo "[$(date)] ERROR: failed to query Docker daemon while waiting for ${container_name} removal." >&2
+ exit 1
+ fi
+ if ! grep -Fxq "${container_name}" <<<"${names}"; then
break
fi
- echo "[$(date)] Waiting for cryptpad-cryptpad-1 to be removed ($i/30)..."
+ echo "[$(date)] Waiting for ${container_name} to be removed ($i/30)..."
sleep 2
done
- if docker ps -a --format '{{.Names}}' | grep -Fxq 'cryptpad-cryptpad-1'; then
- echo "[$(date)] ERROR: cryptpad-cryptpad-1 still present after 60s; aborting deploy." >&2
- docker ps -a --filter name=cryptpad-cryptpad-1 >&2
+ if ! names="$(docker ps -a --format '{{.Names}}')"; then
+ echo "[$(date)] ERROR: failed to query Docker daemon before final container presence check." >&2
+ exit 1
+ fi
+ if grep -Fxq "${container_name}" <<<"${names}"; then
+ echo "[$(date)] ERROR: ${container_name} still present after 60s; aborting deploy." >&2
+ docker ps -a --filter "name=^/${container_name}$" >&2
exit 1
fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/deploy.yml around lines 48 - 58, The current loop and
guard use a piped command like `docker ps -a --format '{{.Names}}' | grep -Fxq
'cryptpad-cryptpad-1'` which can treat a failing `docker ps` as "not found";
change both uses to first run `docker ps -a --format '{{.Names}}'` and capture
its exit status and output into a variable, immediately fail the script if
`docker ps` returns non-zero, and only then run `grep -Fxq
'cryptpad-cryptpad-1'` against the captured output; update the waiting loop
condition (the `while`/`break` logic) and the final guard `if` that checks
presence of `cryptpad-cryptpad-1` to use the captured output variable and the
docker exit check so daemon/query failures are detected and the job aborts
rather than being misinterpreted as "container removed".
The `check` workflow runs woke on the whole tree and flags findings in vendored third-party libraries (mermaid, pdf.js, asciidoctor, etc.) and upstream CryptPad source files. Those are not owned in this fork and rewording them would drift from upstream. Add a `.wokeignore` that the action honours to scope the scan to code actually authored here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AI Code Review\n\nReview unavailable. |
Summary
mainwith a Docker container-name conflict.docker compose down/docker rm -fsodocker compose up -ddoesn't race with an in-flight container removal.Root cause
Every recent failure (runs 24698290423, 24698091488, 24698090290, ...) dies with one of:
Both
docker compose down --timeout 30anddocker rm -freturn to the shell as soon as the daemon accepts the request; actual removal of the filesystem/network resources can still be in progress. The immediately-followingdocker compose up -dthen collides with its own in-flight teardown. The existing|| trueand2>/dev/null || truemasked the cleanup errors but not the subsequentupfailure.Fix
After the existing
down+rm -f, polldocker ps -afor up to 60s until the container name is actually gone, then runup. If the container is still present after the timeout, abort the deploy with a clear error instead of silently racing.64.227.144.36//opt/cryptpad).DEPLOY_SSH_KEY).Note on verification
This workflow only runs on
pushtomain, so it will not run against this PR. The fix will take effect on merge. Behavior under the failure mode is exercised by the new wait loop on the first deploy after merge. If the stuck container persists beyond 60s, the deploy will now fail fast with a clear error surfacing the daemon state (docker ps -a --filter name=cryptpad-cryptpad-1), rather than a confusing conflict fromup.Test plan
mainand observe the next production deploy run.upsucceeds after the wait.Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
Summary by CodeRabbit