Skip to content

claude: add generic launch-devnet skill#21024

Merged
taratorio merged 3 commits intomainfrom
worktree-generic-devnet-launch-skill
May 8, 2026
Merged

claude: add generic launch-devnet skill#21024
taratorio merged 3 commits intomainfrom
worktree-generic-devnet-launch-skill

Conversation

@taratorio
Copy link
Copy Markdown
Member

Summary

Replaces the devnet-specific launch-bal-devnet-3 Claude skill with two reusable skills:

  • launch-devnet — generic launcher for any ethpandaops devnet. Takes only a landing-page URL (e.g. https://bal-devnet-3.ethpandaops.io) and discovers everything else at runtime: chain id and fork schedule from el/genesis.json, CL fork epochs from cl/config.yaml, EL/CL bootnodes and client image tags from the inventory API, public RPC/beacon/checkpoint URLs from the host convention. Auto-detects port conflicts and bumps the offset, generates start-erigon.sh / start-cl.sh / stop.sh / clean.sh, starts erigon → waits for jwt.hex → starts the CL, then monitors EL head vs network head, peer counts, and CL sync status.
  • bal-devnet-ab-test — slim wrapper that reuses launch-devnet for the primary instance and adds a second instance with IGNORE_BAL=true on a +400 port offset for head-to-head throughput comparison (gas/s, repeat%, abort, invalid).

The old launch-bal-devnet-3 skill is deleted; nothing else in the repo referenced it.

Key design point: failure investigation

launch-devnet includes an explicit "finding the absolute truth" section. The default assumption is not that erigon is wrong — on a multi-client devnet, erigon may be spec-correct while another client is buggy, the spec itself may be ambiguous (clients split into factions), or the network/genesis may be broken. The skill instructs the agent to:

  1. Cross-check any divergence against ≥2 non-erigon ELs from the inventory.
  2. Drill down to a specific block/slot/account/storage diff.
  3. Treat the EIP text as authoritative over what other clients do.
  4. Surface findings to the user with concrete data (specific block, account, EIP quote) rather than "please advise" — only after the diff is reproducible across a restart and at least one independent client supports the alternative result.

A "common false-positive signals" list (optimistic head, first-newPayload timeouts, transient peers: 0) keeps the agent from escalating noise.

Files

+ .claude/skills/launch-devnet/SKILL.md          # new — generic launcher
+ .claude/skills/bal-devnet-ab-test/SKILL.md     # new — BAL A/B testing wrapper
- .claude/skills/launch-bal-devnet-3/SKILL.md    # removed — superseded

Test plan

  • Invoke /launch-devnet https://bal-devnet-3.ethpandaops.io and confirm it discovers chain id 7098917910, the Amsterdam fork timestamp, and ≥10 EL/CL bootnodes from the inventory.
  • Confirm port-conflict detection bumps the offset when +100 ports are already bound.
  • Confirm erigon syncs past genesis on bal-devnet-3 with the generated scripts (no hardcoded values).
  • Invoke /launch-devnet against a different devnet (e.g. fusaka-devnet-N) and confirm the same flow works without code changes.
  • Invoke /bal-devnet-ab-test after a successful /launch-devnet run and confirm Instance B starts on +400 ports with IGNORE_BAL=true exported.
  • Trigger a synthetic state-root divergence and confirm the skill cross-checks against ≥2 non-erigon ELs before reporting a root cause.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the devnet-specific Claude skill for bal-devnet-3 with a reusable, generic launch-devnet skill and a thin bal-devnet-ab-test wrapper to run BAL vs no-BAL throughput comparisons on BAL devnets.

Changes:

  • Added launch-devnet skill that discovers devnet configuration at runtime (genesis/config/inventory), generates start/stop/clean scripts, and provides monitoring + cross-client failure investigation guidance.
  • Added bal-devnet-ab-test skill that reuses launch-devnet and spins up a second instance with IGNORE_BAL=true on a separate port offset.
  • Removed the legacy launch-bal-devnet-3 devnet-specific skill.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
.claude/skills/launch-devnet/SKILL.md New generic devnet launcher skill with runtime discovery, script generation, monitoring, and investigation workflow.
.claude/skills/bal-devnet-ab-test/SKILL.md New wrapper skill for running a second “no-BAL” instance to compare throughput metrics.
.claude/skills/launch-bal-devnet-3/SKILL.md Removed devnet-specific launcher skill (superseded by the generic launcher + wrapper).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .claude/skills/launch-devnet/SKILL.md Outdated
Comment on lines +43 to +50
mkdir -p $WORKDIR/testnet-config
CFG=https://config.${DEVNET}.ethpandaops.io
curl -fsSL ${CFG}/el/genesis.json -o $WORKDIR/genesis.json
curl -fsSL ${CFG}/cl/config.yaml -o $WORKDIR/testnet-config/config.yaml
curl -fsSL ${CFG}/cl/genesis.ssz -o $WORKDIR/testnet-config/genesis.ssz
curl -fsSL ${CFG}/api/v1/nodes/inventory -o $WORKDIR/inventory.json
echo 0 > $WORKDIR/testnet-config/deploy_block.txt
echo 0 > $WORKDIR/testnet-config/deposit_contract_block.txt
Comment thread .claude/skills/launch-devnet/SKILL.md Outdated

```bash
for p in <each chosen port>; do
ss -tlnp 2>/dev/null | awk -v P=":$p" '$4 ~ P {print P}'
Comment thread .claude/skills/launch-devnet/SKILL.md Outdated

```bash
JWT=$WORKDIR/erigon-data/jwt.hex
test -f "$JWT" # MUST exist; start-erigon.sh creates it
Comment thread .claude/skills/launch-devnet/SKILL.md Outdated
--execution-jwt=/jwt.hex \
--boot-nodes="<comma-joined ENRs>" \
--port=<CL P2P> --quic-port=<CL QUIC> \
--http --http-port=<CL HTTP> --http-address=0.0.0.0 \
Comment thread .claude/skills/launch-devnet/SKILL.md Outdated
Comment on lines +216 to +223
`stop.sh`:
- `docker stop <DEVNET>-cl` (and `docker rm`)
- `pkill -f "datadir.*$WORKDIR/erigon-data"`

`clean.sh`:
- runs `stop.sh`
- removes `erigon-data/{chaindata,snapshots,txpool,nodes,temp}` and `cl-data/*`
- re-runs `erigon init`
Comment on lines +92 to +96
docker stop "<devnet>-nobal-cl" 2>/dev/null
docker rm "<devnet>-nobal-cl" 2>/dev/null
pkill -f "datadir.*${WORKDIR}-nobal/erigon-data"
# Optional — wipe disk
rm -rf "${WORKDIR}-nobal"
Copy link
Copy Markdown
Member

@yperbasis yperbasis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Significant issues

  1. ss -tlnp is Linux-only. Both launch-devnet/SKILL.md (Step 4) and bal-devnet-ab-test/SKILL.md use ss for port-conflict detection. The user's primary platform is darwin, where ss
    doesn't exist (which ss → not found). The sibling erigon-ephemeral skill correctly uses lsof -nP -iTCP: -sTCP:LISTEN. CLAUDE.md asks for cross-platform shell. The old skill had
    the same bug (it's only mentioned in troubleshooting), so this isn't a regression — but since you're rewriting the file anyway, this is the moment to fix it. Suggested:
    for p in ; do
    lsof -nP -iTCP:$p -sTCP:LISTEN 2>/dev/null | grep -q LISTEN && echo "$p in use"
    done
    Same for the verification step that says ss -tlnp | grep .
  2. bal-devnet-ab-test Step "Set up Instance B" tells the agent to "Pick a second port offset that doesn't collide with Instance A. If Instance A uses +100, Instance B should use +400"
    — but it never tells the agent how to learn what offset A used. launch-devnet records the chosen offset in devnet-info.txt, but the wrapper doesn't say "read it from there." Easy
    fix: add OFFSET_A=$(grep '^port_offset:' "$WORKDIR/devnet-info.txt" | awk '{print $2}') (or however you decide to format the field) and derive B's offset from that.
  3. The jwt.hex/authrpc poll is only a comment. Step 8:
  cd $WORKDIR && nohup bash start-erigon.sh > erigon-console.log 2>&1 &                                                                                                                  
  # Poll until both conditions are true (timeout ~60s):                                                                                                                                  
  #   - $WORKDIR/erigon-data/jwt.hex exists                                                                                                                                              
  #   - the authrpc port is bound by the erigon PID                                                                                                                                      
  cd $WORKDIR && nohup bash start-cl.sh > cl-console.log 2>&1 &

An agent following this verbatim will fire both lines back-to-back and the CL will fail JWT auth on the first newPayload. Make it an actual loop or call it out as a separate step
the agent must execute, not a comment.
4. docker pull always runs. The old skill checked docker image inspect first and only pulled on miss. Pull is idempotent so it's fast on cache hit, but pull against an unauthenticated
network with intermittent connectivity can stall. Probably fine, but the previous behavior was nicer.

Nits

  1. vs placeholder casing inconsistency. Step 7 uses -cl (uppercase placeholder), the A/B skill uses -nobal-cl (lowercase). Step 1 explicitly says
    DEVNET=bal-devnet-3 (lowercase value). Pick one — they're the same variable. matches reality.
  2. Step 7 sells "any CL client" but the actual recipe is lighthouse-only. The "for non-lighthouse CLs, look up the client's CLI flags" hedge is honest, but the description says "a CL
    client" rather than "Lighthouse (with hooks for other CLs)". Either narrow the description or add a one-line table mapping client → command name for the common ones.
  3. rm -rf cl-data/* in clean.sh loses dotfiles. Use rm -rf cl-data && mkdir cl-data for cleaner intent.
  4. Checkpoint-sync URL is assumed to exist (https://checkpoint-sync.${DEVNET}.ethpandaops.io) — fresh devnets sometimes don't have one yet. A curl -fsI probe before baking it into the
    script would let the skill fall back to genesis sync gracefully. Not a blocker — lighthouse will just emit warnings if the URL 404s.
  5. python3 -c '...' for JSON parsing in Step 9 — jq is already required for inventory parsing in Step 2. Pick one for consistency.
  6. devnet-info.txt would be more useful as a key/value file (or JSON) than free-form prose, since the A/B wrapper needs to read fields from it programmatically (see issue #2).

Suggested commit before merge

Fix issues 1–3 (cross-platform port check, A/B offset discovery, jwt.hex poll as a real step). Everything else can land as a follow-up or be ignored.

- Switch port-conflict detection from ss (Linux-only) to lsof (cross-platform)
  and check both TCP and UDP families. Matches the pattern in erigon-ephemeral.
- Quote $WORKDIR/$CFG in download/init steps so paths with spaces work.
- Make Step 8's jwt.hex+authrpc check a real polling loop with timeouts and
  abort guards instead of a comment, so the CL doesn't race ahead and fail
  JWT auth on the first newPayload.
- Promote test -f "$JWT" to a hard abort guard ([ -f ] || { echo; exit 1; }).
- Bind Lighthouse beacon API to 127.0.0.1 by default; document widening to
  0.0.0.0 only when remote access is needed.
- Capture erigon's PID at start time and kill it by PID in stop.sh instead
  of pkill -f, which can match unrelated processes when $WORKDIR contains
  regex metacharacters. Skill now requires start-erigon.sh to end with `exec`
  so the captured PID is the erigon PID.
- Replace `rm -rf cl-data/*` with `rm -rf cl-data && mkdir cl-data` so
  dotfiles don't survive the glob.
- Probe the checkpoint-sync URL before baking it into start-cl.sh; fall back
  to genesis sync if the endpoint isn't provisioned yet.
- Skip docker pull if the image is already cached.
- Switch monitoring from python3 to jq for consistency with the rest of the
  skill, which already requires jq.
- Rewrite devnet-info.txt as a key/value file so wrappers like
  bal-devnet-ab-test can read fields (notably port_offset) programmatically
  instead of guessing.
- bal-devnet-ab-test: read OFFSET_A from devnet-info.txt and derive
  OFFSET_B = OFFSET_A + 300 instead of assuming A used +100.
- bal-devnet-ab-test: add sanity guards (NOBAL_DIR ends with -nobal, marker
  file present) before rm -rf, and stop Instance B by PID.
- Fix <DEVNET>/<devnet> placeholder casing inconsistency by switching to
  the actual ${DEVNET} bash expansion.
- Clarify that the Step 7 recipe is Lighthouse-specific and add a small
  client→subcommand mapping table for non-lighthouse CLs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@taratorio
Copy link
Copy Markdown
Member Author

Thanks for the careful review @yperbasis @copilot-pull-request-reviewer — went through every comment and addressed all of them in f189e46. None warranted pushback: all were either correct (e.g. ss is genuinely Linux-only on darwin), exposed a real silent failure (test -f as a no-op guard, jwt.hex poll-as-comment), or were cheap consistency wins.

yperbasis — significant issues

  1. ss -tlnplsof in both skills, checking TCP listeners and UDP separately. Matches the cross-platform pattern already in erigon-ephemeral/SKILL.md.
  2. A/B offset discovery. devnet-info.txt is now a key: value file (one of which is port_offset). The wrapper reads it via awk and derives OFFSET_B = OFFSET_A + 300 instead of guessing.
  3. jwt.hex / authrpc poll is now a real loop with a 60s timeout and abort guards — the same loop is reused in the A/B wrapper for Instance B.
  4. docker pull is gated behind docker image inspect, so cached images skip the network call.

yperbasis — nits

  1. Placeholder casing — switched <DEVNET> references to ${DEVNET} (actual bash expansion) for consistency between the two skills.
  2. Step 7 now explicitly flags the recipe as Lighthouse-specific and includes a client→subcommand mapping table for lodestar / prysm / teku / nimbus.
  3. clean.sh uses rm -rf <dir> && mkdir <dir> so dotfiles don't survive the wipe.
  4. Step 7 probes checkpoint-sync.<devnet>.ethpandaops.io with curl -fsI and omits the --checkpoint-sync-url flag (lighthouse falls back to genesis sync) if the endpoint isn't provisioned yet.
  5. Monitoring switched from python3 to jq (already required for inventory parsing).
  6. Folded into Pull from go-ethereum up to 2f24e25 (6 Mar 2019) #2devnet-info.txt is now key/value, parseable by other skills.

Copilot — inline comments

  • $WORKDIR / $CFG quoted in Step 1 and Step 5 (so paths with spaces work).
  • Port detection is exact now via lsof -iTCP:$p -sTCP:LISTEN (no awk substring matching) and covers UDP.
  • test -f "$JWT" is a hard abort: [ -f "$JWT" ] || { echo "JWT missing"; exit 1; }.
  • Lighthouse --http-address defaults to 127.0.0.1; the recipe documents widening to 0.0.0.0 only when remote access is needed.
  • pkill -f replaced with a PID-file approach in both skills. start-erigon.sh ends with exec ./build/bin/erigon … so $! captured by the parent is the erigon PID, and stop.sh kills that exact PID.
  • A/B cleanup gates rm -rf "${WORKDIR}-nobal" on three sanity checks: WORKDIR non-empty, NOBAL_DIR ends with -nobal, and the directory carries an ignore_bal: true marker that the A/B setup writes. If any check fails, the script refuses to wipe and prints a message.

@taratorio taratorio added this pull request to the merge queue May 8, 2026
Merged via the queue into main with commit 074f0cd May 8, 2026
66 of 70 checks passed
@taratorio taratorio deleted the worktree-generic-devnet-launch-skill branch May 8, 2026 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants