Skip to content

Appliance workflow scripts, not core y-cluster features#27

Open
solsson wants to merge 16 commits into
mainfrom
appliance-workflows
Open

Appliance workflow scripts, not core y-cluster features#27
solsson wants to merge 16 commits into
mainfrom
appliance-workflows

Conversation

@solsson
Copy link
Copy Markdown
Contributor

@solsson solsson commented May 12, 2026

To get bearable dev loops I had to keep hosting-specific scripts in this repo

Uses #19 and #21 to set up actual clusters, given external "init" (the script that applies stuff) and "verify" (smoke test before export).

Yolean k8s-qa and others added 16 commits May 12, 2026 13:57
Brings every scripts/ change from agents/appliance-export-import
to upstream main as a single bump. The Go and testdata side of
this work landed in #19 (appliance-primitives); this commit is
the operator-facing bash that drives the released binary
through the appliance lifecycle, plus the .env-style config
the scripts source.

What's new vs main:

  - appliance-build-hetzner.sh / appliance-build-virtualbox.sh:
    interactive build flows producing a .qcow2 + a VirtualBox-
    importable .ova respectively, both via the released
    y-cluster binary's prepare-export and export subcommands.
  - appliance-publish-hetzner.sh: pushes a built appliance to
    Hetzner Object Storage for handoff.
  - appliance-qemu-to-gcp.sh: end-to-end qemu -> GCP custom
    image flow (export --format=gcp-tar -> gsutil cp -> compute
    images create) with persistent /data/yolean disk preserved
    across redeploys, plus a teardown subcommand.
  - gcp-bootstrap-credentials.sh: one-shot bootstrap for the
    service account / project / key file the GCP flow needs.
  - e2e-appliance-export-import.sh: local qemu -> qemu round-
    trip exercising the full prepare-export / export / import
    cycle without any cloud cred dependency.
  - e2e-appliance-hetzner.{sh,pkr.hcl}: Packer-based snapshot
    flow; lays the snapshot down once, spins fresh servers on
    top to verify boot.
  - e2e-appliance-qemu-to-gcp.sh: non-interactive driver of
    appliance-qemu-to-gcp.sh end to end, including teardown.
  - .env.example + .gitignore: documents every overridable
    knob (GCP_PROJECT, GCP_KEY, H_S3_ENV_FILE, ENV_FILE) with
    a generic example path; .env stays out of git.

Configuration: required values are operator-supplied via env
vars (no built-in defaults). Each script derives REPO_ROOT
from BASH_SOURCE and sources $REPO_ROOT/.env via `set -o
allexport` when present, so the .env path works regardless of
CWD (including `cd /tmp && bash /path/to/script`). Missing
required values fail fast with a clear "set $VAR in .env or
shell env" message.

Scope: scripts/ + repo-root .env plumbing. The Go side is
already on main via #19. Both `go build ./...` and `go test
./...` are unchanged-clean on this branch -- the scripts add
no go.mod or testdata edits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Yolean dev / setup scripts that smoke-test the gateway expect a
host-side port that reaches guest 80. Today's qemu-side host port
forwards default to 39080 in both the Go e2e helper and the bash
appliance-build scripts, so any consumer that hardcodes
http://localhost:80 has to remember the offset.

This host (and most modern Linux distros) ships
net.ipv4.ip_unprivileged_port_start=80, so qemu's user-mode
hostfwd inherits the ability to bind port 80 without root. Default
APP_HTTP_PORT and the e2e port-forward helper to 80 in lockstep:

  - e2e/qemu_test.go: e2eUniqueForwards now takes both apiPort
    and httpPort; every test passes its own pair (28443 / 28444 /
    ... vs 26443 / 26444 / ...) keyed off the apiPort so concurrent
    test runs on the same host don't collide. Each test always gets
    a guest-80 forward, matching what the appliance-build scripts
    install.
  - scripts/appliance-{qemu-to-gcp,build-hetzner,build-virtualbox}.sh
    + scripts/e2e-appliance-{export-import,qemu-to-gcp}.sh: the
    APP_HTTP_PORT default flips from 39080 to 80, with YHELP /
    inline curl examples updated to match. Override via env
    (APP_HTTP_PORT=39080 ./scripts/...) on hosts that keep port 80
    privileged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Appliance e2e / build flows install workloads, build a seed
tarball, prepare-export, and re-boot from the prepared disk -- the
cumulative footprint pushes the 20G default disk into pressure on
the kubelet's image-gc thresholds, which surfaces as flaky pod
evictions mid-test or mid-build.

Bump to 40G everywhere a 20G default sat:

  - e2e/qemu_test.go: e2eQEMURuntime overrides DiskSize to 40G so
    every qemu e2e test boots with the larger disk by default.
  - scripts/appliance-{qemu-to-gcp,build-hetzner,build-virtualbox}.sh
    + scripts/e2e-appliance-{export-import,qemu-to-gcp}.sh: the
    generated y-cluster-provision.yaml now sets diskSize: "40G".
  - scripts/appliance-qemu-to-gcp.sh: --boot-disk-size on
    `gcloud compute instances create` flips from 20GB to 40GB so
    the GCE VM doesn't reject the 40G custom image with "Requested
    disk size cannot be smaller than the image size".

qcow2 is sparse, so the host-disk footprint only grows with actual
usage; the larger virtual size is a no-cost ceiling. The GCE side
similarly uses a thin-provisioned persistent disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The appliance-build / e2e scripts each carried a defaults block:

    APP_HTTP_PORT="${APP_HTTP_PORT:-80}"
    APP_API_PORT="${APP_API_PORT:-39443}"
    APP_SSH_PORT="${APP_SSH_PORT:-2229}"

then interpolated those into the heredoc'd y-cluster-provision.yaml.
Three of the four values restate y-cluster's own defaults
(80/6443/2222 in pkg/provision/config); the bash defaults that
DIFFERED (39443 vs 6443; 2229 vs 2222) were chosen for collision
avoidance against an operator's regular y-cluster, but were quiet
duplicates of the same defaulting concept.

Replace the heredoc with a brace block that emits each port field
ONLY when the env var is set. Net behaviour:

  - No env override   -> minimal YAML; y-cluster fills 2222 +
                         {6443:6443, 80:80, 443:443}.
  - APP_HTTP_PORT=N   -> only the host:N -> guest:80 entry lands;
                         API/SSH still y-cluster-default.
  - Multiple set      -> all set entries land; requireHostAPIPort
                         validates that a guest:6443 entry exists.

Display refs (banner curl examples, ssh commands, smoketest
probes) use ${APP_*_PORT:-NN} inline so the printed URL/SSH
command shows the right port whether overridden or default.
YHELP entries reworded from "(default: 80)" to
"(y-cluster default: 80)" so the operator sees who owns the
default.

IMP_HTTP_PORT / IMP_SSH_PORT in e2e-appliance-export-import.sh
left as-is (test-only; the import-side qemu is started directly,
no y-cluster CLI involvement, so y-cluster's defaults don't
apply).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symmetric with APP_HTTP_PORT / APP_API_PORT: a new
APP_HTTPS_PORT env var lets operators override the host port
forwarded to guest 443. Unset means "let y-cluster apply its
default" -- the YAML still omits the field when no port var is
set, which matches the behaviour for the other ports.

Without this, an operator who overrides any one of {HTTP, API}
silently lost 443 forwarding (the YAML's portForwards block
became canonical and didn't include 443; previously y-cluster's
[6443:6443, 80:80, 443:443] default applied only when the bash
emitted no portForwards at all).

The host:guest match keeps standard ports inside the appliance
unchanged; the host-side ip_unprivileged_port_start sysctl on
modern Linux distros allows binding 443 without root the same
way 80 already does.

YHELP entries updated to surface the new knob.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a post-deploy step that offers to stand up a GCP regional
External Application Load Balancer in front of the appliance VM
with a self-signed cert covering operator-supplied FQDNs.
Idempotent (describe-then-create) so re-runs converge; teardown
integrated into the existing teardown subcommand.

Why a self-signed cert and a prompt-not-default

The cert-manager → upload-real-cert path is the eventual
production shape, but for the dev loop a self-signed cert lets
the operator verify the LB stack + HTTPRoute hostname matching
without DNS / CA dependencies. The opt-in default is a billing
meter (forwarding rule ~hourly, reserved IP) the operator should
deliberately accept; we don't want a forgotten ASSUME_YES run to
silently provision one.

Operator UX

  - Default: prompts after the HTTP probe with a one-paragraph
    explainer (cost, self-signed cert, HTTPRoute prerequisite),
    accepts comma-separated FQDNs, empty skips.
  - TLS_DOMAINS env var preset: skips the prompt and runs.
  - ASSUME_YES alone: skips silently (unattended e2e shouldn't
    surprise-bill).
  - Final banner prints the LB IP + a single /etc/hosts line
    covering all FQDNs, marks the cert SELF-SIGNED, points at
    the gcloud commands to swap in a real cert later.

Resources, all named ${NAME}-tls-*

  proxy-only subnet (reuses any ACTIVE one in the region;
                     creates per-build only when none exists)
  static regional IP
  SSL cert (uploaded, self-signed)
  HTTP health check on /q/envoy/echo
  zonal NEG with the VM as endpoint
  backend service (EXTERNAL_MANAGED) + add-backend
  URL map (default-service points at the backend)
  target HTTPS proxy
  forwarding rule on :443

Teardown

do_tls_teardown is invoked from the existing do_teardown so a
plain `appliance-qemu-to-gcp.sh teardown` cleans up the LB
stack alongside the VM/image/object/disk. Order forces the
forwarding rule first (stops the meter), then proxy / url-map /
backend / NEG / health-check / cert / IP. Subnet last and only
when it's the per-build one (we never delete a reused regional
subnet). Each delete is idempotent: missing resources are not
errors. The `Will DELETE:` inventory now lists `${NAME}-tls-*`
when a forwarding rule of that shape exists.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes for the GCP appliance smoke flow:

1. do_tls_frontend now creates a :80 forwarding rule that 301s
   to :443. Previously the function set up only a :443 listener,
   so any `curl http://<lb-ip>/...` against the LB IP hung at TCP
   connect (no listener on 80). Hangs from `curl ... http://...`
   were diagnosed against the live ext-app01-* LB stack which
   has the same shape.

   Mechanism: GCP regional EXTERNAL_MANAGED URL maps can either
   forward (defaultService) or redirect (defaultUrlRedirect),
   not both, so the redirect needs its own URL map. The chain:

       :80 fwd -> tls-http-proxy -> tls-redirect URL map (301 to https)
       :443 fwd -> tls-proxy      -> tls-urlmap (existing, ->backend)

   `gcloud compute url-maps create` has no flag for default-
   redirect, hence the `url-maps import` from a heredoc.
   Hostname-agnostic on both ports: every request, any Host:,
   either redirects (on :80) or forwards to the VM (on :443).
   The VM's envoy-gateway is the only Host-aware hop.

   do_tls_teardown grew matching delete calls in dependency order
   (forwarding rules -> proxies -> URL maps) so re-runs converge
   cleanly.

2. The post-deploy probe at the end of the GCP stage now
   enumerates HTTPRoute + GRPCRoute hostnames via SSH +
   `sudo k3s kubectl ... -o jsonpath` and probes each FQDN
   through `--resolve <fqdn>:80:$PUBLIC_IP`. Replaces the
   single-path `/q/envoy/echo` probe -- which only verified
   "envoy answers anything", not "every advertised route is
   reachable end-to-end".

   Reachability == any HTTP status code (2xx/3xx/4xx/5xx),
   not 200: a route that legitimately answers 302 / 401 / 404 is
   still proof the firewall + klipper-lb + envoy-gateway chain
   is working. Only `000` (timeout / refused) counts as
   unreachable. On any unreachable route the script logs a
   warning with diagnostic suggestions (firewall source-ranges
   narrowed, backend Service not Ready, workload still rolling
   out) and continues -- info-level surfacing today, gating /
   strict mode is a deliberately deferred follow-up.

Falls back to the old `/q/envoy/echo` probe when the cluster has
no Gateway-bound routes (a workload that hasn't applied yet).

Verified end-to-end against the live appliance: 4 routes
enumerated (dev.yolean.net, ext-app01.yolean.se, keycloak-admin,
keycloak-admin.ext-app01.yolean.se), all returned HTTP 302 on
the first attempt. The redirect chain itself is intentionally
NOT exercised against ext-app01-* in this commit (would require
mutating an in-use LB the operator owns); it lands on the next
do_tls_frontend run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…own-side PRESERVED message

State preservation across appliance redeploys is the overarching
design goal of the data-seed mechanism (commit f69addf +
APPLIANCE_MAINTENANCE.md). What was missing on the operator-
facing side: the QA-flow build script silently reused the
persistent disk on every redeploy, masking the seed-skip from
build-time-only operators who expected each run to validate the
seed end-to-end. Conversely, the production "preserve customer
state across upgrades" intent was never written down where it
mattered (the operator only saw a generic banner at deploy time,
not after teardown when the disk-keep decision is most actionable).

Changes:

  - Build-flow `--reuse-disk=true|false` with an interactive
    prompt (default Y -- preserve, matching the design goal).
    On `--reuse-disk=false` the script delete-and-recreates the
    persistent disk so the next boot's data-seed unit lands the
    OS image's seed cleanly. Non-TTY callers MUST pass the flag
    explicitly; ASSUME_YES + missing flag fails fast rather than
    silently picking a default for an irreversible decision.

  - Teardown `--keep-disk=true|false`. Default behavior is
    unchanged (keep). Legacy `--delete-data-disk` continues to
    work as `--keep-disk=false` with a one-line deprecation
    notice, so any existing automation isn't broken.

  - Decoupled the new disk decisions from the existing
    `confirm()` helper (which consults ASSUME_YES). New
    `prompt_yes_default()` helper requires a TTY or an
    explicit flag, never falls back to ASSUME_YES. The umbrella
    ASSUME_YES still covers the existing 'Proceed?' + TLS-LB
    prompts.

  - Moved the "Persistent data disk PRESERVED" message from
    the build-success banner to the END of teardown when the
    disk was kept. That's the moment the operator's mental
    model needs the reminder ('what survived?' + 'how do I
    delete it later?'). The build success block keeps a brief
    one-line pointer to teardown's message instead of carrying
    the full paragraph.

Verified end-to-end against yo-sre-appliance-qa over the past
two days: --reuse-disk=false correctly recreates the disk and
the data-seed unit extracts the image's seed onto it; the
recreated disk + grastate.dat workaround round-tripped
mariadb's keycloak.REALM rows through prepare-export -> seed
-> fresh-disk -> boot, with `keycloak/auth/realms/ext-bfv01`
returning 200 from the resulting cluster.

Two follow-up fixes lined up but not in this commit (kept
working-tree, separate commit): a `return 0` belt at the end
of do_tls_teardown so its trailing `[[ -n "$subnet" ]] && ...`
doesn't leak a non-zero exit and abort the caller before the
new PRESERVED block fires; and the revert of the route-
enumeration block that this same teardown-issue debugging
surfaced as post-import SSH+kubectl scope-creep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The build VM occasionally OOMs during heavier customer
workloads applied at PROMPT 1 (mariadb + kafka + envoy +
the bundled controllers all in 4GB is tight). 8GB matches
the y-cluster default for stand-alone provisions but the
qemu-to-gcp script was overriding it down to 4GB to keep
the host's headroom; the headroom is fine on the build
host, so lift the override.

The y-cluster default itself is unchanged (8192 in
config.QEMUConfig.applyDefaults), so other provisioner
flows (multipass, docker, plain qemu) are not affected.
Disk size stays at 40GB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #20 changed prepare-export to require the cluster RUNNING:
its live phase clears the per-deploy dns-hint-ip annotation
and snapshots reconciled Gateway state into <cacheDir>/<name>-
gateway-state.json (both need the apiserver up). prepare-export
then stops the VM itself before its offline (virt-customize)
phase.

The plan called for dropping `y-cluster stop` from the script
ahead of prepare-export, but the script edit never landed. The
result: every run of appliance-qemu-to-gcp.sh would stop the
cluster, then crash with "VM not running; start the cluster
first" when prepare-export ran against the stopped VM.

Drop the explicit stop call. Update the docstring stage list
to reflect that prepare-export does its own stop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the parallel-list footgun: today the operator declares
hostnames in HTTPRoute manifests AND in TLS_DOMAINS, and drift
between the two means the LB cert covers hostnames the cluster
doesn't serve, or vice versa.

Setting TLS_DOMAINS=auto now resolves the FQDN list by calling
`y-cluster gateway hostnames --csv` against the just-provisioned
cluster, immediately after PROMPT 1 confirmation. The cluster's
reconciled HTTPRoute / GRPCRoute hostnames become the LB cert
SAN list -- one source of truth.

Resolution runs BEFORE prepare-export because by the TLS LB
stage (after prepare-export + GCP deploy) the local apiserver
is gone. Other TLS_DOMAINS values (literal CSV / empty /
prompt) are still handled at the LB stage as before.

Empty result aborts with an explicit error (operator asked for
auto, none found = something wrong with the cluster state).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hooks

The unattended flow had ASSUME_YES + TLS_DOMAINS=auto landed
already, but no work-doing hook in PROMPT 1's hands-on window.
Result: a build with ASSUME_YES=1 reached prepare-export with
only the y-cluster echo HTTPRoute applied; TLS_DOMAINS=auto
then aborted because the cluster had no non-wildcard hostnames
to derive from.

Add the two hooks documented in
specs/y-cluster/FEATURE_APPLIANCE_AUTOMATED_FLOW.md:

- APPLIANCE_SEED_CMD runs after echo install, before PROMPT 1.
  Customer workloads applied here populate /data/yolean for the
  data-seed extraction AND give TLS_DOMAINS=auto real hostnames.
- APPLIANCE_VERIFY_CMD runs at the end, after the GCP deploy
  + optional TLS LB. Receives the LB IP / VM IP / domains via
  the Y_CLUSTER_CURRENT_* surface so a remote probe can curl
  --resolve through the deployed VM without /etc/hosts.

Both fire via `bash -c "$cmd"` so the operator-supplied string
can pipe / chain / cd freely. Both export a single, consistent
Y_CLUSTER_CURRENT_* env surface (via the new current_env
helper) -- a verify script `printenv | grep ^Y_CLUSTER_CURRENT_`
sees the full surface either way; vars not yet known at the
seed hook (REMOTE_VM_IP, etc.) are exported as empty strings.

Non-zero exit aborts under set -e. Local cluster / VM / LB
stay up for inspection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e stack)

Observed an appliance build that ran fine for ~2h at 91-93% memory
on a 4 GiB e2-medium, then died at 100% CPU / 3807 MiB used: ssh
banner exchange timed out, :443 + :6443 went REFUSED while :80
kept LISTEN with the userspace too starved to respond. Classic OOM
spiral. The full appliance stack (k3s + containerd + keycloak +
envoy gateway + envoy proxy + mysql + kafka) sits within ~300 MiB
of the 4 GiB ceiling at idle; any workload spike pushes it over.

e2-standard-2 (2 vCPU / 8 GiB) gives the stack the headroom it
needs. GCE machine types bundle CPU + memory, so there's no
separate memory override -- that's spelled out in both the help
text and the default-assignment comment so the next operator
reading either spot sees why we don't surface a GCP_MEMORY knob.
GCP_MACHINE_TYPE stays as the escape hatch for highmem / larger
shapes.
The previous commit added "e2-medium's" and "there's" inside the
single-quoted YHELP heredoc. Single quotes in bash can't contain
single quotes, so the apostrophes terminated the string mid-block;
the resumed unquoted "4 GiB OOMs ..." got parsed as a command,
and any consumer that sourced or executed the help block saw
"line 76: 4: command not found".

Reworded to avoid the apostrophes entirely. bash -n parses the
file clean and --help renders the section as intended.
Both files-pointed-at-by-env-var inputs surfaced the same
foot-gun: a malformed value passed the existence check but
failed deep inside the tool we shelled into, with a less
helpful message:

  - GCP_KEY pointing at a truncated / wrong-format JSON
    (e.g. a re-exported key that lost its private_key during
    a copy-paste) only erred at `gcloud auth
    activate-service-account`, by which point the operator
    has already proven the file exists. Now `jq -e` checks
    that the four fields GCP requires for a service-account
    auth are populated -- type=service_account, project_id,
    client_email, private_key -- and errors with the missing
    field names so the operator knows what to fix.

  - H_S3_REGION accepted any string and only surfaced "could
    not resolve host" when the upload URL hit a non-existent
    endpoint hostname. The help text already documents the
    valid set (fsn1, hel1, nbg1); now the script enforces
    it at config-load time with a message naming the valid
    values.

Both checks fire BEFORE any cloud-side state change. Adds no
new dependency: jq is already required by the broader
appliance flow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant