Appliance workflow scripts, not core y-cluster features by solsson · Pull Request #27 · Yolean/y-cluster

solsson · 2026-05-12T14:10:39Z

To get bearable dev loops I had to keep hosting-specific scripts in this repo

Uses #19 and #21 to set up actual clusters, given external "init" (the script that applies stuff) and "verify" (smoke test before export).

Brings every scripts/ change from agents/appliance-export-import to upstream main as a single bump. The Go and testdata side of this work landed in #19 (appliance-primitives); this commit is the operator-facing bash that drives the released binary through the appliance lifecycle, plus the .env-style config the scripts source. What's new vs main: - appliance-build-hetzner.sh / appliance-build-virtualbox.sh: interactive build flows producing a .qcow2 + a VirtualBox- importable .ova respectively, both via the released y-cluster binary's prepare-export and export subcommands. - appliance-publish-hetzner.sh: pushes a built appliance to Hetzner Object Storage for handoff. - appliance-qemu-to-gcp.sh: end-to-end qemu -> GCP custom image flow (export --format=gcp-tar -> gsutil cp -> compute images create) with persistent /data/yolean disk preserved across redeploys, plus a teardown subcommand. - gcp-bootstrap-credentials.sh: one-shot bootstrap for the service account / project / key file the GCP flow needs. - e2e-appliance-export-import.sh: local qemu -> qemu round- trip exercising the full prepare-export / export / import cycle without any cloud cred dependency. - e2e-appliance-hetzner.{sh,pkr.hcl}: Packer-based snapshot flow; lays the snapshot down once, spins fresh servers on top to verify boot. - e2e-appliance-qemu-to-gcp.sh: non-interactive driver of appliance-qemu-to-gcp.sh end to end, including teardown. - .env.example + .gitignore: documents every overridable knob (GCP_PROJECT, GCP_KEY, H_S3_ENV_FILE, ENV_FILE) with a generic example path; .env stays out of git. Configuration: required values are operator-supplied via env vars (no built-in defaults). Each script derives REPO_ROOT from BASH_SOURCE and sources $REPO_ROOT/.env via `set -o allexport` when present, so the .env path works regardless of CWD (including `cd /tmp && bash /path/to/script`). Missing required values fail fast with a clear "set $VAR in .env or shell env" message. Scope: scripts/ + repo-root .env plumbing. The Go side is already on main via #19. Both `go build ./...` and `go test ./...` are unchanged-clean on this branch -- the scripts add no go.mod or testdata edits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Yolean dev / setup scripts that smoke-test the gateway expect a host-side port that reaches guest 80. Today's qemu-side host port forwards default to 39080 in both the Go e2e helper and the bash appliance-build scripts, so any consumer that hardcodes http://localhost:80 has to remember the offset. This host (and most modern Linux distros) ships net.ipv4.ip_unprivileged_port_start=80, so qemu's user-mode hostfwd inherits the ability to bind port 80 without root. Default APP_HTTP_PORT and the e2e port-forward helper to 80 in lockstep: - e2e/qemu_test.go: e2eUniqueForwards now takes both apiPort and httpPort; every test passes its own pair (28443 / 28444 / ... vs 26443 / 26444 / ...) keyed off the apiPort so concurrent test runs on the same host don't collide. Each test always gets a guest-80 forward, matching what the appliance-build scripts install. - scripts/appliance-{qemu-to-gcp,build-hetzner,build-virtualbox}.sh + scripts/e2e-appliance-{export-import,qemu-to-gcp}.sh: the APP_HTTP_PORT default flips from 39080 to 80, with YHELP / inline curl examples updated to match. Override via env (APP_HTTP_PORT=39080 ./scripts/...) on hosts that keep port 80 privileged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Appliance e2e / build flows install workloads, build a seed tarball, prepare-export, and re-boot from the prepared disk -- the cumulative footprint pushes the 20G default disk into pressure on the kubelet's image-gc thresholds, which surfaces as flaky pod evictions mid-test or mid-build. Bump to 40G everywhere a 20G default sat: - e2e/qemu_test.go: e2eQEMURuntime overrides DiskSize to 40G so every qemu e2e test boots with the larger disk by default. - scripts/appliance-{qemu-to-gcp,build-hetzner,build-virtualbox}.sh + scripts/e2e-appliance-{export-import,qemu-to-gcp}.sh: the generated y-cluster-provision.yaml now sets diskSize: "40G". - scripts/appliance-qemu-to-gcp.sh: --boot-disk-size on `gcloud compute instances create` flips from 20GB to 40GB so the GCE VM doesn't reject the 40G custom image with "Requested disk size cannot be smaller than the image size". qcow2 is sparse, so the host-disk footprint only grows with actual usage; the larger virtual size is a no-cost ceiling. The GCE side similarly uses a thin-provisioned persistent disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The appliance-build / e2e scripts each carried a defaults block: APP_HTTP_PORT="${APP_HTTP_PORT:-80}" APP_API_PORT="${APP_API_PORT:-39443}" APP_SSH_PORT="${APP_SSH_PORT:-2229}" then interpolated those into the heredoc'd y-cluster-provision.yaml. Three of the four values restate y-cluster's own defaults (80/6443/2222 in pkg/provision/config); the bash defaults that DIFFERED (39443 vs 6443; 2229 vs 2222) were chosen for collision avoidance against an operator's regular y-cluster, but were quiet duplicates of the same defaulting concept. Replace the heredoc with a brace block that emits each port field ONLY when the env var is set. Net behaviour: - No env override -> minimal YAML; y-cluster fills 2222 + {6443:6443, 80:80, 443:443}. - APP_HTTP_PORT=N -> only the host:N -> guest:80 entry lands; API/SSH still y-cluster-default. - Multiple set -> all set entries land; requireHostAPIPort validates that a guest:6443 entry exists. Display refs (banner curl examples, ssh commands, smoketest probes) use ${APP_*_PORT:-NN} inline so the printed URL/SSH command shows the right port whether overridden or default. YHELP entries reworded from "(default: 80)" to "(y-cluster default: 80)" so the operator sees who owns the default. IMP_HTTP_PORT / IMP_SSH_PORT in e2e-appliance-export-import.sh left as-is (test-only; the import-side qemu is started directly, no y-cluster CLI involvement, so y-cluster's defaults don't apply). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Symmetric with APP_HTTP_PORT / APP_API_PORT: a new APP_HTTPS_PORT env var lets operators override the host port forwarded to guest 443. Unset means "let y-cluster apply its default" -- the YAML still omits the field when no port var is set, which matches the behaviour for the other ports. Without this, an operator who overrides any one of {HTTP, API} silently lost 443 forwarding (the YAML's portForwards block became canonical and didn't include 443; previously y-cluster's [6443:6443, 80:80, 443:443] default applied only when the bash emitted no portForwards at all). The host:guest match keeps standard ports inside the appliance unchanged; the host-side ip_unprivileged_port_start sysctl on modern Linux distros allows binding 443 without root the same way 80 already does. YHELP entries updated to surface the new knob. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a post-deploy step that offers to stand up a GCP regional External Application Load Balancer in front of the appliance VM with a self-signed cert covering operator-supplied FQDNs. Idempotent (describe-then-create) so re-runs converge; teardown integrated into the existing teardown subcommand. Why a self-signed cert and a prompt-not-default The cert-manager → upload-real-cert path is the eventual production shape, but for the dev loop a self-signed cert lets the operator verify the LB stack + HTTPRoute hostname matching without DNS / CA dependencies. The opt-in default is a billing meter (forwarding rule ~hourly, reserved IP) the operator should deliberately accept; we don't want a forgotten ASSUME_YES run to silently provision one. Operator UX - Default: prompts after the HTTP probe with a one-paragraph explainer (cost, self-signed cert, HTTPRoute prerequisite), accepts comma-separated FQDNs, empty skips. - TLS_DOMAINS env var preset: skips the prompt and runs. - ASSUME_YES alone: skips silently (unattended e2e shouldn't surprise-bill). - Final banner prints the LB IP + a single /etc/hosts line covering all FQDNs, marks the cert SELF-SIGNED, points at the gcloud commands to swap in a real cert later. Resources, all named ${NAME}-tls-* proxy-only subnet (reuses any ACTIVE one in the region; creates per-build only when none exists) static regional IP SSL cert (uploaded, self-signed) HTTP health check on /q/envoy/echo zonal NEG with the VM as endpoint backend service (EXTERNAL_MANAGED) + add-backend URL map (default-service points at the backend) target HTTPS proxy forwarding rule on :443 Teardown do_tls_teardown is invoked from the existing do_teardown so a plain `appliance-qemu-to-gcp.sh teardown` cleans up the LB stack alongside the VM/image/object/disk. Order forces the forwarding rule first (stops the meter), then proxy / url-map / backend / NEG / health-check / cert / IP. Subnet last and only when it's the per-build one (we never delete a reused regional subnet). Each delete is idempotent: missing resources are not errors. The `Will DELETE:` inventory now lists `${NAME}-tls-*` when a forwarding rule of that shape exists. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related fixes for the GCP appliance smoke flow: 1. do_tls_frontend now creates a :80 forwarding rule that 301s to :443. Previously the function set up only a :443 listener, so any `curl http://<lb-ip>/...` against the LB IP hung at TCP connect (no listener on 80). Hangs from `curl ... http://...` were diagnosed against the live ext-app01-* LB stack which has the same shape. Mechanism: GCP regional EXTERNAL_MANAGED URL maps can either forward (defaultService) or redirect (defaultUrlRedirect), not both, so the redirect needs its own URL map. The chain: :80 fwd -> tls-http-proxy -> tls-redirect URL map (301 to https) :443 fwd -> tls-proxy -> tls-urlmap (existing, ->backend) `gcloud compute url-maps create` has no flag for default- redirect, hence the `url-maps import` from a heredoc. Hostname-agnostic on both ports: every request, any Host:, either redirects (on :80) or forwards to the VM (on :443). The VM's envoy-gateway is the only Host-aware hop. do_tls_teardown grew matching delete calls in dependency order (forwarding rules -> proxies -> URL maps) so re-runs converge cleanly. 2. The post-deploy probe at the end of the GCP stage now enumerates HTTPRoute + GRPCRoute hostnames via SSH + `sudo k3s kubectl ... -o jsonpath` and probes each FQDN through `--resolve <fqdn>:80:$PUBLIC_IP`. Replaces the single-path `/q/envoy/echo` probe -- which only verified "envoy answers anything", not "every advertised route is reachable end-to-end". Reachability == any HTTP status code (2xx/3xx/4xx/5xx), not 200: a route that legitimately answers 302 / 401 / 404 is still proof the firewall + klipper-lb + envoy-gateway chain is working. Only `000` (timeout / refused) counts as unreachable. On any unreachable route the script logs a warning with diagnostic suggestions (firewall source-ranges narrowed, backend Service not Ready, workload still rolling out) and continues -- info-level surfacing today, gating / strict mode is a deliberately deferred follow-up. Falls back to the old `/q/envoy/echo` probe when the cluster has no Gateway-bound routes (a workload that hasn't applied yet). Verified end-to-end against the live appliance: 4 routes enumerated (dev.yolean.net, ext-app01.yolean.se, keycloak-admin, keycloak-admin.ext-app01.yolean.se), all returned HTTP 302 on the first attempt. The redirect chain itself is intentionally NOT exercised against ext-app01-* in this commit (would require mutating an in-use LB the operator owns); it lands on the next do_tls_frontend run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…own-side PRESERVED message State preservation across appliance redeploys is the overarching design goal of the data-seed mechanism (commit f69addf + APPLIANCE_MAINTENANCE.md). What was missing on the operator- facing side: the QA-flow build script silently reused the persistent disk on every redeploy, masking the seed-skip from build-time-only operators who expected each run to validate the seed end-to-end. Conversely, the production "preserve customer state across upgrades" intent was never written down where it mattered (the operator only saw a generic banner at deploy time, not after teardown when the disk-keep decision is most actionable). Changes: - Build-flow `--reuse-disk=true|false` with an interactive prompt (default Y -- preserve, matching the design goal). On `--reuse-disk=false` the script delete-and-recreates the persistent disk so the next boot's data-seed unit lands the OS image's seed cleanly. Non-TTY callers MUST pass the flag explicitly; ASSUME_YES + missing flag fails fast rather than silently picking a default for an irreversible decision. - Teardown `--keep-disk=true|false`. Default behavior is unchanged (keep). Legacy `--delete-data-disk` continues to work as `--keep-disk=false` with a one-line deprecation notice, so any existing automation isn't broken. - Decoupled the new disk decisions from the existing `confirm()` helper (which consults ASSUME_YES). New `prompt_yes_default()` helper requires a TTY or an explicit flag, never falls back to ASSUME_YES. The umbrella ASSUME_YES still covers the existing 'Proceed?' + TLS-LB prompts. - Moved the "Persistent data disk PRESERVED" message from the build-success banner to the END of teardown when the disk was kept. That's the moment the operator's mental model needs the reminder ('what survived?' + 'how do I delete it later?'). The build success block keeps a brief one-line pointer to teardown's message instead of carrying the full paragraph. Verified end-to-end against yo-sre-appliance-qa over the past two days: --reuse-disk=false correctly recreates the disk and the data-seed unit extracts the image's seed onto it; the recreated disk + grastate.dat workaround round-tripped mariadb's keycloak.REALM rows through prepare-export -> seed -> fresh-disk -> boot, with `keycloak/auth/realms/ext-bfv01` returning 200 from the resulting cluster. Two follow-up fixes lined up but not in this commit (kept working-tree, separate commit): a `return 0` belt at the end of do_tls_teardown so its trailing `[[ -n "$subnet" ]] && ...` doesn't leak a non-zero exit and abort the caller before the new PRESERVED block fires; and the revert of the route- enumeration block that this same teardown-issue debugging surfaced as post-import SSH+kubectl scope-creep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The build VM occasionally OOMs during heavier customer workloads applied at PROMPT 1 (mariadb + kafka + envoy + the bundled controllers all in 4GB is tight). 8GB matches the y-cluster default for stand-alone provisions but the qemu-to-gcp script was overriding it down to 4GB to keep the host's headroom; the headroom is fine on the build host, so lift the override. The y-cluster default itself is unchanged (8192 in config.QEMUConfig.applyDefaults), so other provisioner flows (multipass, docker, plain qemu) are not affected. Disk size stays at 40GB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #20 changed prepare-export to require the cluster RUNNING: its live phase clears the per-deploy dns-hint-ip annotation and snapshots reconciled Gateway state into <cacheDir>/<name>- gateway-state.json (both need the apiserver up). prepare-export then stops the VM itself before its offline (virt-customize) phase. The plan called for dropping `y-cluster stop` from the script ahead of prepare-export, but the script edit never landed. The result: every run of appliance-qemu-to-gcp.sh would stop the cluster, then crash with "VM not running; start the cluster first" when prepare-export ran against the stopped VM. Drop the explicit stop call. Update the docstring stage list to reflect that prepare-export does its own stop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the parallel-list footgun: today the operator declares hostnames in HTTPRoute manifests AND in TLS_DOMAINS, and drift between the two means the LB cert covers hostnames the cluster doesn't serve, or vice versa. Setting TLS_DOMAINS=auto now resolves the FQDN list by calling `y-cluster gateway hostnames --csv` against the just-provisioned cluster, immediately after PROMPT 1 confirmation. The cluster's reconciled HTTPRoute / GRPCRoute hostnames become the LB cert SAN list -- one source of truth. Resolution runs BEFORE prepare-export because by the TLS LB stage (after prepare-export + GCP deploy) the local apiserver is gone. Other TLS_DOMAINS values (literal CSV / empty / prompt) are still handled at the LB stage as before. Empty result aborts with an explicit error (operator asked for auto, none found = something wrong with the cluster state). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hooks The unattended flow had ASSUME_YES + TLS_DOMAINS=auto landed already, but no work-doing hook in PROMPT 1's hands-on window. Result: a build with ASSUME_YES=1 reached prepare-export with only the y-cluster echo HTTPRoute applied; TLS_DOMAINS=auto then aborted because the cluster had no non-wildcard hostnames to derive from. Add the two hooks documented in specs/y-cluster/FEATURE_APPLIANCE_AUTOMATED_FLOW.md: - APPLIANCE_SEED_CMD runs after echo install, before PROMPT 1. Customer workloads applied here populate /data/yolean for the data-seed extraction AND give TLS_DOMAINS=auto real hostnames. - APPLIANCE_VERIFY_CMD runs at the end, after the GCP deploy + optional TLS LB. Receives the LB IP / VM IP / domains via the Y_CLUSTER_CURRENT_* surface so a remote probe can curl --resolve through the deployed VM without /etc/hosts. Both fire via `bash -c "$cmd"` so the operator-supplied string can pipe / chain / cd freely. Both export a single, consistent Y_CLUSTER_CURRENT_* env surface (via the new current_env helper) -- a verify script `printenv | grep ^Y_CLUSTER_CURRENT_` sees the full surface either way; vars not yet known at the seed hook (REMOTE_VM_IP, etc.) are exported as empty strings. Non-zero exit aborts under set -e. Local cluster / VM / LB stay up for inspection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e stack) Observed an appliance build that ran fine for ~2h at 91-93% memory on a 4 GiB e2-medium, then died at 100% CPU / 3807 MiB used: ssh banner exchange timed out, :443 + :6443 went REFUSED while :80 kept LISTEN with the userspace too starved to respond. Classic OOM spiral. The full appliance stack (k3s + containerd + keycloak + envoy gateway + envoy proxy + mysql + kafka) sits within ~300 MiB of the 4 GiB ceiling at idle; any workload spike pushes it over. e2-standard-2 (2 vCPU / 8 GiB) gives the stack the headroom it needs. GCE machine types bundle CPU + memory, so there's no separate memory override -- that's spelled out in both the help text and the default-assignment comment so the next operator reading either spot sees why we don't surface a GCP_MEMORY knob. GCP_MACHINE_TYPE stays as the escape hatch for highmem / larger shapes.

…each build)

The previous commit added "e2-medium's" and "there's" inside the single-quoted YHELP heredoc. Single quotes in bash can't contain single quotes, so the apostrophes terminated the string mid-block; the resumed unquoted "4 GiB OOMs ..." got parsed as a command, and any consumer that sourced or executed the help block saw "line 76: 4: command not found". Reworded to avoid the apostrophes entirely. bash -n parses the file clean and --help renders the section as intended.

Both files-pointed-at-by-env-var inputs surfaced the same foot-gun: a malformed value passed the existence check but failed deep inside the tool we shelled into, with a less helpful message: - GCP_KEY pointing at a truncated / wrong-format JSON (e.g. a re-exported key that lost its private_key during a copy-paste) only erred at `gcloud auth activate-service-account`, by which point the operator has already proven the file exists. Now `jq -e` checks that the four fields GCP requires for a service-account auth are populated -- type=service_account, project_id, client_email, private_key -- and errors with the missing field names so the operator knows what to fix. - H_S3_REGION accepted any string and only surfaced "could not resolve host" when the upload URL hit a non-existent endpoint hostname. The help text already documents the valid set (fsn1, hel1, nbg1); now the script enforces it at config-load time with a message naming the valid values. Both checks fire BEFORE any cloud-side state change. Adds no new dependency: jq is already required by the broader appliance flow.

Yolean k8s-qa and others added 16 commits May 12, 2026 13:57

GCP NEG endpoint re-attach is idempotent on re-runs (VM is recreated …

009bea4

…each build)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appliance workflow scripts, not core y-cluster features#27

Appliance workflow scripts, not core y-cluster features#27
solsson wants to merge 16 commits into
mainfrom
appliance-workflows

solsson commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

solsson commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant