Skip to content

Add experimental Content-Addressable Storage (CAS) backups#1367

Open
filimonov wants to merge 190 commits intoAltinity:masterfrom
filimonov:feat/cas-backups
Open

Add experimental Content-Addressable Storage (CAS) backups#1367
filimonov wants to merge 190 commits intoAltinity:masterfrom
filimonov:feat/cas-backups

Conversation

@filimonov
Copy link
Copy Markdown
Member

Add experimental Content-Addressable Storage (CAS) backups

⚠️ Experimental. Layout may change incompatibly before stable. See docs/cas-design.md and docs/cas-operator-runbook.md.

Summary

Adds seven new commands — cas-{upload,download,restore,delete,verify,prune,status} — that store backups in a content-addressable layout where every file is keyed by the CityHash128 already in ClickHouse's per-part checksums.txt. Identical content is stored once and referenced from any number of backups, across mutations, days, and tables. There is no RequiredBackup chain — every backup is independently restorable; deleting one never affects another. Storage grows with new content, not with backup count.

The feature lives side-by-side with v1 (upload/download/restore) under a separate root prefix in the bucket. v1 commands stay byte-for-byte unchanged.

Why

Three operational pain points motivated this:

  1. Mutations re-upload everything. A single ALTER TABLE … UPDATE rewrites one column file and renames the part. Today's incremental dedup is name-based, so the new part looks unrelated and 99% of identical bytes get re-transferred. Content-addressing dedups them.
  2. Incremental chains are fragile. v1 incremental backups depend on a base; restore unrolls the chain. Operators avoid the chain by taking expensive full backups. CAS removes the dependency: each backup is independent.
  3. Storage scales with backup count. Keeping 30 daily snapshots of a slowly-changing dataset is roughly the same cost as keeping one.

How to enable

cas:
  enabled: true
  cluster_id: my-prod-cluster   # required, unique per source cluster

Then:

clickhouse-backup create my_backup
clickhouse-backup cas-upload my_backup
clickhouse-backup cas-restore my_backup

All seven commands also work over the existing daemon REST API (POST /backup/cas-upload/{name} etc.; cas-* verbs in /backup/actions).

What's new

Operator-facing:

  • 7 new CLI commands, 7 new REST endpoints
  • New cas: config block with enabled, cluster_id, root_prefix, inline_threshold, grace_blob, abandon_threshold, wait_for_prune, allow_unsafe_markers, skip_conditional_put_probe, allow_unsafe_object_disk_skip
  • cas-prune mark-and-sweep GC; cas-verify for HEAD+size integrity checks
  • cas-status daemon endpoint for monitoring

Implementation:

  • pkg/cas/ — core engine (~6 KLOC + ~6 KLOC tests)
  • pkg/checksumstxt/ — new parser for ClickHouse's part-file checksums format (versions 2/3/4)
  • pkg/storage/ — new PutFileAbsoluteIfAbsent / PutFileIfAbsent on all 6 backends (S3, Azure, GCS, COS, SFTP, FTP) for atomic markers
  • One-shot startup probe verifies S3-compatible backends honor If-None-Match: * (refuses MinIO < 2024-11)

Breaking changes

See ChangeLog.md vNEXT section. Top three:

  1. Don't downgrade with CAS data in the bucket. A pre-CAS binary will treat the cas/ namespace as a broken v1 backup and silently delete it on the next clean remote_broken or BackupsToKeepRemote run. Recovery procedure in the operator runbook.
  2. pkg/storage.RemoteStorage interface gains two methods (PutFileAbsoluteIfAbsent, PutFileIfAbsent). Third-party implementations need to add them.
  3. pkg/storage.BackupDestination.BackupList signature gains a skipPrefixes []string parameter.

A v1 backup literally named "cas" (matching the default root_prefix) will be filtered after upgrade; rename before upgrading.

How to review (suggested reading order)

The branch has many small commits because it landed in 8 phases plus 6 review waves. The cleanest way through it is by package, not by commit. Suggested order:

1. Read the design first

  • docs/cas-design.md — full design (~700 lines). §1–§3 = motivation/scope, §6 = the protocol, §7 = what's reused, §8 = risk register, §9 = the v2 backlog of known deferrals.
  • docs/cas-operator-runbook.md — what an operator sees: first-deployment walkthrough, recovery procedures, REST endpoints.

2. Read the data model

  • pkg/checksumstxt/ (smallest; standalone) — parser for the ClickHouse on-disk format. The hash CAS uses comes from here. format.md is the reference spec.
  • pkg/cas/types.go + pkg/cas/blobpath.go + pkg/cas/paths.go — blob-key derivation and the on-bucket layout (where everything lives under cas/<cluster>/).
  • pkg/cas/config.go — the cas: config block, validation rules, the SkipPrefixes() contract that protects CAS from v1 retention.

3. Storage primitives

  • pkg/storage/structs.go — interface-level: PutFileAbsoluteIfAbsent, ErrConditionalPutNotSupported, the BackupList signature change.
  • pkg/storage/{s3,gcs,azblob,cos,sftp,ftp}.go (and tests) — per-backend conditional-create implementations. Read S3's IfNoneMatch: "*" first (canonical), then SFTP's O_EXCL, then FTP (the refuse-by-default one with the allow_unsafe_markers opt-in best-effort fallback).
  • pkg/cas/casstorage/backend_storage.go — the thin adapter that wraps pkg/storage.BackupDestination as the narrower cas.Backend interface. This is the import-cycle breaker; everything in pkg/cas/ talks to cas.Backend, never to pkg/storage directly.

4. The hot path — upload

  • pkg/cas/upload.go — the 13-step upload protocol. Each step is commented with its number and intent. Steps 5 (write inprogress marker) → 11 (pre-commit re-checks) → 13 (commit metadata.json) are the load-bearing sequence; everything between is parallel blob/archive uploads. Note the single deferred cleanup at step 5 (added in wave-6 review Support incremental backups #3): on any error path including panic, the marker is released via a detached context.
  • pkg/cas/coldlist.go + pkg/cas/cache.go — the parallel 256-shard LIST that seeds the existence set so the same blob isn't re-uploaded.
  • pkg/cas/archive.go — per-table tar.zstd builder for the inline files.
  • pkg/cas/markers.go + pkg/cas/probe.go + pkg/cas/wait.go — atomic mutual exclusion, conditional-put probe, polite waiting on prune-marker.

5. Download and restore

  • pkg/cas/download.go — materializes a backup into a v1-shaped local directory. Note the staging-dir + atomic-rename pattern (added in an earlier review wave) — partial downloads never leave a "looks valid" directory at the final path.
  • pkg/cas/restore.go — thin wrapper: cas-download into a staging dir, then hand off to v1's b.Restore. The handoff metadata sets BackupMetadata.CAS.Handoff = true so v1's cross-mode guards distinguish "raw CAS backup" (refuse) from "CAS handoff layout" (allow + skip object-disk fetch).

6. Lifecycle: delete and prune

  • pkg/cas/delete.go — ordered delete (metadata.json first, then subtree). Writes its own inprogress marker tagged Tool: cas-delete to lock out concurrent same-name uploads on other hosts.
  • pkg/cas/prune.go + pkg/cas/sweep.go + pkg/cas/markset.go — mark-and-sweep GC. External-mergesort mark set spilled to disk; container/heap-based shard iterator for the sweep phase. Holds an exclusive prune.marker; explicit --unlock escape hatch when stranded.

7. Verify, list, status

  • pkg/cas/verify.go — HEAD + size check (no content re-hash; that's a v2 deferred item).
  • pkg/cas/list.go — backup catalog walker; surfaces CAS backups in clickhouse-backup list remote and the REST list endpoint.
  • pkg/cas/status.go — bucket health summary (LIST-only, cheap).

8. The integration glue

  • pkg/backup/cas_methods.goBackuper.CASUpload/CASDownload/CASRestore/CASDelete/CASVerify/CASPrune/CASStatus. Object-disk preflight (fail-closed on disk-query failure), pidlock around cas-download phase, probe-singleton injection for daemon mode.
  • cmd/clickhouse-backup/cas_commands.go — CLI flag definitions and action wiring.
  • pkg/server/cas_handlers.go + pkg/server/actions_cas.go — REST handlers (mostly async with the existing status.Current pattern; cas-status is sync).
  • pkg/server/server.go — route registration; new casProbeState field on APIServer so the conditional-put probe fires once per daemon lifetime, not once per request.
  • pkg/backup/list.goCollectRemoteCASBackups for the merged /backup/list response.

9. Tests

  • pkg/cas/internal/fakedst/ — in-memory cas.Backend fake with hooks for error injection.
  • pkg/cas/internal/testfixtures/ — synthetic local-backup builder (parts with deterministic hashes).
  • test/integration/cas_*.go — testcontainer-driven end-to-end coverage across MinIO, Azurite, fake-gcs-server, OpenSSH-server SFTP, and proftpd FTP. The full CAS integration suite passes 19 PASS / 2 SKIP / 0 FAIL.

Test plan

Running on this branch:

  • go test ./pkg/cas/... ./pkg/storage/... ./pkg/backup/... ./pkg/server/... -race -count=1 — ✅
  • go vet ./... — ✅
  • go test ./... whole tree, -short — ✅
  • RUN_PARALLEL=1 go test -tags=integration ./test/integration/ -run TestCAS -timeout 90m — ✅ (19 PASS, 2 SKIP)

CI in .github/workflows/build.yaml will run the integration suite across the ClickHouse-version matrix on this PR.

What's deferred

The docs/cas-design.md §9 backlog tracks ~30 known follow-up items. None block v1 of CAS. The largest categories:

  • §9.1 Major features (deferred): hash-verification on download (cas-verify --deep), object-disk parts, convergent encryption, cas-fsck, parallel cas-verify, distributed locking via S3 conditional create (beyond the same-name race we already close), local-disk / NFS target.
  • §9.2 Performance / scalability: prune mark-phase parallelism (already partial; further gains possible), SweepOrphans spill-to-disk, streaming archive upload via io.Pipe, heap-merge for shardIter, replace ColdList with per-blob PutFileIfAbsent (architectural alternative).
  • §9.3 Operability: --log-format=json for prune, populate BlobsTotal/OrphansHeldByGrace (already partial), per-blob progress logging.
  • §9.4 Correctness defenses: pkg/pidlock TOCTOU (whole-tool, not CAS-specific), probe-key cleanup on prune, S3 IfNoneMatch startup probe (already shipped — listed for clarity).
  • §9.5 Tests: tighten GCS/COS/FTP not-found tests to call real production code (mirror functions in test file today).
  • §9.6 UX polish: cas-delete --force, wait_for_prune defaults docs, etc.

Known limitations (operator-facing)

In docs/cas-operator-runbook.md "Known limitations (v1)":

  • --tables patterns are filepath.Match glob, not regex
  • Object-disk tables refused (S3/Azure/HDFS/encrypted-over-S3)
  • Multi-host concurrent upload to the same backup name unsupported (use unique names per writer)
  • Hash verification on download is HEAD + size only
  • No per-blob resumable uploads
  • FTP requires opt-in allow_unsafe_markers for best-effort markers
  • MinIO < RELEASE.2024-11-07 rejected by conditional-put probe
  • No cross-cluster blob sharing

Phasing of the work in this branch

For maintainers who do want to see how the work landed: 8 implementation phases + 6 external review-feedback waves. Commit groups roughly correspond to:

  • Phases 1–2: MVP upload+restore, then prune
  • Phase 3: planner correctness (projections, encoded-vs-decoded names, empty tables)
  • Phase 4: atomic markers across all 6 backends
  • Phase 5: per-backend integration smoke tests
  • Phase 6: P1 defects from external review wave 1 (marker-leak, dry-run-unlock, zero-ModTime classification, etc.)
  • Phase 7 (cleanup round): ColdList TOCTOU re-validation, PruneReport counters, cfg.Validate() defensive guards
  • Phase 8: wait_for_prune knob + REST API wiring
  • Wave 5 + 6: external-review feedback rounds covering atomic-download, fail-closed object-disk preflight, encrypted-over-S3 detection, conditional-put probe per-process key, blob-upload byte verification, name-collision guards, async cas-delete, streaming verify, dry-run candidate API, and many smaller hardenings

git log --oneline master..HEAD shows the full sequence; commit subjects are tagged with the relevant fix-id.

filimonov and others added 30 commits May 7, 2026 16:06
Two plans derived from docs/cas-design.md:

- Phase 1 + 1.5: cas-upload, cas-download, cas-restore, cas-delete,
  cas-verify, cas-status, plus v1<->CAS isolation guards. Independently
  shippable (backup/restore works without GC).
- Phase 2: cas-prune mark-and-sweep with deferred marker release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promote the mature checksums.txt parser from docs/checksumstxt/ to
pkg/checksumstxt/ so production CAS packages can import it. All 11
tests pass; go build ./... and go vet ./... clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ion/multi-block)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… layout

This is the spec the cas-phase1 plan implements. format.md is already tracked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add cas.Config with defaults, Validate(), and ClusterPrefix(). Wire into
pkg/config.Config (field, DefaultConfig, ValidateConfig). Move
backend_storage.go to pkg/cas/casstorage to break the import cycle that
would have resulted from pkg/config importing pkg/cas importing pkg/storage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hardens path-key construction against operator misconfiguration that could
produce traversal-flavored object keys (review feedback on Task 6).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add sentinel errors (errors.go) and ValidateBackup (validate.go) which
loads and verifies CAS metadata.json — enforcing name syntax, layout
version bounds, inline_threshold range, and cluster_id ownership.
Include validate_test.go covering happy path and all 7 failure modes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hardens validateName beyond the regex; cited as Minor in Task 10 review.
Add DetectObjectDiskTables + IsObjectDiskType to pkg/cas, plus TableInfo
and DiskInfo value types in types.go. Using local structs avoids the
pkg/cas ↔ pkg/clickhouse import cycle. 6 unit tests cover longest-prefix
matching, sibling-prefix false-positives, deduplication, and empty inputs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement Upload(): parses each part's checksums.txt, classifies
files into inline (≤ threshold, packed into per-(disk,db,table)
tar.zstd) and blob (> threshold, content-addressed by Hash128),
cold-lists existing blobs to dedup, enforces marker discipline
with pre-commit re-checks, then commits the root metadata.json
last.

Adds pkg/cas/internal/testfixtures for synthesizing local-backup
trees with valid v2 checksums.txt so tests can drive Upload
end-to-end against the in-memory fakedst backend.
Download(ctx, b, cfg, name, opts) reads cas/<cluster>/metadata/<name>/
metadata.json via ValidateBackup, fetches per-table TableMetadata JSONs,
applies TableFilter / Partitions / SchemaOnly filters, downloads and
extracts per-(disk, db, table) tar.zstd archives into the local shadow
tree, then walks each part's on-disk checksums.txt and fetches every
file with size > InlineThreshold from cas/blob/<aa>/<rest>.

Local layout produced is exactly what v1 restore consumes when
DataFormat="directory":
  <localDir>/<name>/metadata.json
  <localDir>/<name>/metadata/<enc_db>/<enc_table>.json
  <localDir>/<name>/shadow/<enc_db>/<enc_table>/<disk>/<part>/<file>

Path containment is enforced for both tar extraction (via the existing
ExtractArchive) and blob writes (re-asserted before O_TRUNC). Filenames
parsed from checksums.txt are validated (reject "..", NUL, leading "/",
nested paths unless they match the projection layout <name>.proj/<file>).

Disk-space pre-flight uses syscall.Statfs (already a project dep) with
a 1.1x safety margin; failures are best-effort and skipped if the
syscall is unavailable.

Tests (8): RoundTripBytes, RefusesV1Backup, RefusesUnsupportedLayoutVersion,
TableFilter, SchemaOnly, RejectsTraversalFilename, RejectsTarTraversal,
PartitionFilter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add pkg/cas/restore.go: a thin orchestrator that runs cas.Download and
hands off to a caller-supplied V1RestoreFunc (the CLI binding will wire
it to pkg/backup.Backuper.Restore in Task 19). The callback indirection
keeps pkg/cas free of any dependency on pkg/backup.

Patch the v1 entry points in pkg/backup to refuse CAS-shaped backups,
returning cas.ErrCASBackup so callers can errors.Is the sentinel:
- Restore: after BackupMetadata is unmarshalled (line ~138).
- Download: after remoteBackup is found (line ~129).
- RemoveBackupRemote: inside the BackupList loop (line ~344).

In Restore, also bypass the two downloadObjectDiskParts calls (lines
~2168, ~2212) when backupMetadata.CAS != nil. CAS backups never carry
object-disk parts (cas-upload preflight rejects object-disk tables),
but the existing v1 detector inspects live ClickHouse disk types
rather than backup metadata, so it has to be told explicitly.

V1 unit tests for the new guards require full Backuper plumbing
(ClickHouse client + storage backend + initDisksPathsAndBackupDestination);
deferred to integration coverage in Task 21. CAS-side wrapper tests
(restore_test.go) cover happy path, callback error propagation,
ignore-dependencies rejection, nil-callback error, and Download error
propagation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Delete removes the metadata subtree with metadata.json deleted first for
atomic catalog removal; stale inprogress-marker detection, prune-in-progress
guard, and best-effort stale-marker cleanup all covered by 6 tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add Verify() which loads backup metadata, streams per-table tar.zstd
archives to extract checksums.txt entries, accumulates expected (path,
size) pairs for every above-threshold blob, then HEADs each blob in
parallel to detect missing blobs and size mismatches. Writes failures
as human-readable lines or line-delimited JSON (opts.JSON=true) to the
provided io.Writer; returns ErrVerifyFailures when any failures exist.
Add ErrVerifyFailures sentinel to errors.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement Status(ctx, b, cfg) -> *StatusReport and PrintStatus(r, w) in
pkg/cas/status.go. LIST-only (no GETs): walks metadata/, blob/, inprogress/
and stats prune.marker. Classifies in-progress markers as fresh or abandoned
relative to cfg.AbandonThreshold. Four tests covering empty bucket, post-upload
counts, prune marker detection, and fresh/abandoned classification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add six urfave/cli v1 subcommands (cas-upload, cas-download, cas-restore,
cas-delete, cas-verify, cas-status) as thin shims over pkg/cas. Each command
opens a *storage.BackupDestination via the existing Backuper init path,
adapts it through pkg/cas/casstorage.NewStorageBackend, and translates
pkg/cas results / errors to CLI exit codes.

cas-restore wires a V1RestoreFunc closure that delegates back to
Backuper.Restore on the local directory cas-download just materialized;
--ignore-dependencies is accepted but rejected by cas.Restore because CAS
backups have no dependency chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review feedback on Task 19: CASUpload/CASDownload/CASRestore acquire a
pidlock; CASDelete should too, matching v1 'delete' (pkg/backup/delete.go:99).
A concurrent cas-upload + cas-delete on the same name now mutex correctly.
Add test/integration/cas_test.go covering Task 21 of the CAS Phase 1 plan:

- TestCASRoundtrip: end-to-end create → cas-upload → cas-status → drop
  database → cas-restore → checksum-style row count + sum verification
  → cas-delete → cas-status (gone).
- TestCASCrossModeGuards: §6.2.2 isolation. Verifies that v1 download/
  delete-remote refuse a CAS backup, that cas-download/cas-delete refuse
  a v1 backup, and that same-mode operations succeed.
- TestCASVerify: cas-verify happy path; stretch case removes a single
  blob from MinIO and asserts the next cas-verify exits non-zero with a
  "missing" diagnostic. Stretch is best-effort: skipped (with a warning)
  if the bucket layout does not match the assumed minio path.

Implementation notes:
- casBootstrap writes a CAS-flavored config to /tmp/config-cas.yml inside
  the clickhouse-backup container (configs/ is read-only mounted). The
  config is the stock config-s3.yml plus a cas: stanza with a per-test
  cluster_id so concurrent envPool slots cannot trample each other.
- Each test has its own cluster_id (roundtrip / guards / verify).
- Tests do NOT actually run as part of this commit; verified only via
  go vet -tags=integration ./test/integration/... and go test -c -tags=
  integration. Real run is deferred to manual / CI execution per Task 22.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-remote integration

BackupList in pkg/storage now accepts skipPrefixes and ignores any top-level
entry matching them. cas.Config.SkipPrefixes() returns ["cas/"] (or the
configured root_prefix) when CAS is enabled, else nil. Every BackupList
caller in pkg/backup passes b.cfg.CAS.SkipPrefixes(); CleanRemoteBroken
and RemoveOldBackupsRemote inherit the exclusion via GetRemoteBackups /
the new arg, so v1 retention no longer mistakes cas/<cluster>/... for a
broken backup directory.

cas.ListRemoteCAS walks cas/<cluster>/metadata/<bk>/metadata.json,
parses each metadata.json (degrading to a "broken" description on read
or parse failure rather than dropping the entry), and returns
CASListEntry rows tagged "[CAS]" sorted newest-first.

CollectRemoteBackups in pkg/backup/list.go appends those entries to the
v1 list-remote output when CAS is enabled, so operators see CAS backups
alongside v1 backups in `clickhouse-backup list remote`. Errors from the
CAS side are logged and swallowed — informational listing must not
break on a CAS-side failure that the v1 path just survived.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r README

Three fixes from the cross-cutting final review:

B1. cas-restore was broken: cas.Download wrote a local metadata.json with
    bm.CAS != nil, which the v1 Restore handoff at pkg/backup/restore.go:141
    refuses by design. Strip bm.CAS in the local copy so the on-disk layout
    is indistinguishable from a v1 directory-format backup; the cross-mode
    guard remains effective for direct v1 invocation. The object-disk skip
    for CAS still works because the pre-flight refuses such tables and
    iterating zero parts is a no-op.

UPLOAD STATS. UploadResult now exposes the breakdown operators care about:
total backup content (TotalFiles/TotalBytes), how it was placed
(InlineFiles/InlineBytes vs UniqueBlobs/BlobBytesTotal), and what crossed
the wire on this run (BlobsUploaded/BytesUploaded vs BlobsReused/BytesReused
+ ArchiveBytes). cas-upload prints a multi-line summary highlighting the
content-addressed dedup savings.

README. Rewrote the CAS section for end users: leads with the value-prop
(smart deduplicating backups, smaller uploads than incremental, no chain
dependency, mutation-friendly), no design-doc reference. Added a quick-
start config + command sequence. Updated upload --help to recommend
cas-upload without pointing at internal docs.

Includes the recovered Task 18 commit (BackupList prefix exclusion +
cross-mode list integration) which had been dangling — cherry-picked back
onto cas-phase1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o uploaded TableMetadata

Without these fields the v1 restore handoff cannot recreate tables on a
fresh host (CREATE TABLE statement is empty). Read the per-table JSON
that 'clickhouse-backup create' already wrote to disk and merge into the
uploaded TableMetadata. Add a download-side regression test that the
schema fields survive the round-trip.

This is the load-bearing correctness fix from the cas-phase1 follow-up plan.
filimonov and others added 30 commits May 8, 2026 18:22
Convert httpCASDeleteHandler from synchronous execution to the same
async pattern used by all other CAS handlers (upload, download, restore,
verify, prune). The sync path caused 502/504 from reverse-proxy timeouts
(nginx default 60s) while the delete continued server-side.

Changes:
- Wrap b.CASDelete() in go func with status.Current.StartWithOperationId
  and errorCallback/successCallback, matching the upload handler pattern.
- Return newAsyncAck("cas-delete", name, operationId) immediately (HTTP 200).
- Parse optional callback query parameter (parity with other handlers).
- Delete casDeleteHTTPStatus helper (sync-mode workaround, no longer used)
  and its dedicated TestCasDeleteHTTPStatus unit test.
- Replace TestCASDeleteHandler_LockedWhenBusy with an additional
  TestCASDeleteHandler_AsyncAck that verifies the 200 acknowledged shape.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng (F10)

Surface dry-run candidate blobs to API callers and CLI output so operators
can see exactly what would be deleted without running a live prune.

Changes:
- Add DryRunCandidates []OrphanCandidate to PruneReport (json tagged,
  omitempty; only populated when DryRun=true).
- Add JSON struct tags to all PruneReport fields and OrphanCandidate fields
  so the report serialises cleanly through the REST API response path.
- Accumulate cands into rep.DryRunCandidates in the dry-run branch of
  Prune() (step 11), after the existing structured log.Info() lines.
- Update PrintPruneReport to print a "Would delete:" section listing each
  candidate with its key, human-readable size, and UTC modification time
  when DryRunCandidates is non-empty.
- Fix TestUpload_RefusesIfInprogressMarkerPresent assertion to match the
  updated error message format from upload.go (uses Tool field from marker
  rather than hardcoded "cas-upload"); check for "is in progress for"
  which is stable across both the old and new message formats.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… test)

Wave-D1 scoped review surfaced two minor issues:

- PruneReport.DryRunCandidates was assigned the live cands slice
  returned by SweepOrphans; if the caller mutates cands after the
  function returns, the report silently reflects mutations. Use
  append-copy to break the alias.

- TestUpload_RefusesIfInprogressMarkerPresent asserted only on the
  generic 'is in progress for' substring, which would pass for any
  tool name. Tighten to assert tool=cas-upload, host=host-other,
  and in-progress all three appear in the diagnostic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two integration regressions surfaced by the wave-5 sweep:

1. TestCASAPIRoundtrip asserted cas-delete returns a sync 'success'
   response. F13 (commit ed4f24c) made the handler async, so the
   response is now an async-ack with operation_id. Updated the test
   to use the existing casAPIPostAndCaptureOpID + casAPIWaitForOperation
   pattern, matching every other CAS verb.

2. TestCASUploadRefusesConcurrent injected an inprogress marker with
   tool='test' and asserted the error mentions 'another cas-upload
   is in progress'. N2 (commit f6c3a70) made the diagnostic use the
   marker's Tool field dynamically. Changed the injected marker's
   tool to 'cas-upload' so the test exercises the realistic
   upload-vs-upload conflict.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…5, F28)

Wave-5 summary previously claimed several items were already in §9
when in fact they weren't. Adding now:

§9.4: pkg/pidlock TOCTOU (whole-tool, not CAS-specific) and
       probe-key sweep on prune (F28).
§9.4.x: FTP.AllowUnsafeMarkers struct-field exposure (refactor
       preference, F21).
§9.5: TestPrune_FailClosedOnNilCASMetadata (F23),
       TestBackupList_SkipsV1BackupNamedSameasCASPrefix (F24),
       TestListRemoteCAS_WalkError (F25).

Removed three §9.4 entries that have actually shipped:
  - S3 IfNoneMatch startup probe (Phase 8 B1)
  - RemoteStorage interface compatibility note (in §6.12 + changelog)
  - Real-production error-classification tests (Phase 7 C2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ity#13, Altinity#18)

Altinity#5: wrap marker body reads in io.LimitReader(64KiB) so a corrupt or
oversized object cannot exhaust heap; JSON unmarshal will surface the
truncation as a parse error.

Altinity#11: SFTP PutFileAbsoluteIfAbsent now propagates Close errors on the
success path; if Close fails the partial file is removed so the slot
stays available for the next caller.

Altinity#13: cas-delete success message now reads "metadata removed" and adds a
hint that blob storage is reclaimed by the next cas-prune run.

Altinity#18: replace insertion-sort in sortExpectedBlobs with sort.Slice; O(n
log n) rather than O(n²) for large blob lists.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d guard (Altinity#17, Altinity#19, Altinity#20)

Altinity#17: isCASBackupRemote returns false immediately when cfg.Enabled is
false, avoiding a storage probe when CAS is not configured.

Altinity#19: register a deferred cleanup for the stale upload marker in the
ipOK&&mdOK branch, using a detached 30 s context so the cleanup runs
even when the parent ctx is already cancelled.  The now-redundant
explicit step-6 removal is dropped.

Altinity#20: transient read failure on the inprogress marker now returns an
error immediately ("refusing") rather than falling through to the
stale-marker path, which could otherwise delete live content.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The function had no callers — only collectRefsFromArchive is used.
Remove it entirely to eliminate dead code and the confusion it created
when readers tried to understand the two near-identical functions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rker refusal (W6-A gaps)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All three marker-cleanup defers (prune.go, delete.go, upload.go) used the
operation's ctx. When /backup/kill calls status.Cancel(), the ctx is
cancelled before the function returns, causing the deferred DeleteFile to
receive an already-cancelled context and fail — stranding the marker.

Fix: create a fresh context.WithTimeout(context.Background(), 30s) inside
each defer so cleanup always completes regardless of the operation ctx state.

- pkg/cas/prune.go: prune.marker cleanup defer
- pkg/cas/delete.go: cas-delete inprogress marker cleanup defer (!ipOK path)
  (the ipOK&&mdOK stale-upload-marker defer was already fixed in Wave 6-A)

Tests:
- TestPrune_CancelledContextStillReleasesMarker: pre-cancel ctx, verify
  prune.marker is absent after Prune returns with error

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xplicit cleanup sites (Altinity#3)

Before: 10 hand-written `_ = DeleteInProgressMarker(ctx, b, cp, name)` calls
guarded every error path in Upload. A panic in any step stranded the marker
permanently. The cleanup also used the operation ctx, so a /backup/kill
cancellation before the function returned could cause the cleanup to fail.

After: a single deferred cleanup registered immediately after the marker is
written handles all error paths (including panics). It uses a fresh
context.WithTimeout(context.Background(), 30s) to ensure cleanup succeeds
even when the operation ctx is cancelled.

The success path (step 13) uses a `committed bool` sentinel to skip the
defer's redundant delete: committed=true is set after committing metadata.json
but before the explicit best-effort delete, so a panic during the delete
does not trigger a second attempt.

Removed explicit cleanup sites (10 error-path calls):
1. step 6 planUpload error
2. step 7 ColdList error
3. step 8 uploadMissingBlobs error
4. step 9 uploadPartArchives error
5. step 10 uploadTableJSONs error
6. step 11a prune-marker stat error
7. step 11a prune-marker exists
8. step 11b own-marker stat error
9. step 11c revalidateColdList error
10. step 12 put metadata.json error

Tests:
- TestUpload_ErrorPathCleansInprogressMarker: inject archive PUT failure
  (step 9), verify marker absent after Upload returns error
- TestUpload_CancelledContextStillReleasesMarker: pre-cancel ctx, inject
  Walk error, verify marker absent despite cancelled operation ctx

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- upload.go step-13 success-path explicit DeleteInProgressMarker now uses
  detached cleanCtx, not the operation ctx (consistency with the deferred
  cleanup; survives caller cancellation immediately after commit)

- delete_test.go: replace the broken TestDelete_CancelledContextStillReleasesMarker
  (the pre-cancelled ctx returned from waitForPrune before any defer was
  registered — wrong scenario) with a doc comment explaining why the
  delete-side cancellation invariant is verified by parity with the upload
  and prune tests, all three using the identical defer-with-cleanCtx pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
)

Replace the io.ReadAll + bytes.NewReader pattern in buildVerifySet with
direct streaming: archRC (the io.ReadCloser returned by GetFile) is passed
directly to extractBlobsFromArchive, which already accepts io.Reader and
internally wraps it with zstd.NewReader. This mirrors the identical shape
used by prune.go::collectRefsFromArchive.

Memory impact: per-table archives are no longer buffered in heap; only the
small checksums.txt entries inside each archive are read into memory (via
the existing io.ReadAll in extractBlobsFromArchive). Large archives with
many parts no longer spike heap proportionally to archive size.

Existing TestCASVerify tests cover the streaming refactor end-to-end.
No dedicated large-archive test added (synthesizing 64 MiB test data would
add non-trivial test infrastructure; streaming refactor correctness is
verified by the existing round-trip tests).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add pkg/storage/general_test.go with TestBackupList_SkipPrefixesFiltering.
The test builds a minimal fakeRemoteStorage stub (implements RemoteStorage
by supplying Walk + no-op stubs for all other methods) and wraps it in a
BackupDestination to exercise BackupList directly without any real backend.

Three cases:
  1. skipPrefixes=["cas/"] — "cas/" filtered, "casematch" NOT filtered
     (validates HasPrefix-without-trailing-slash semantics).
  2. skipPrefixes=nil — all four entries pass through.
  3. skipPrefixes=[""] — empty string prefix skips nothing (defensive).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to ERROR (Altinity#9)

Sub-change A: pkg/storage/general.go — promote the BackupList skip-prefix
log entry from log.Warn to log.Error. Operators reviewing logs will see this
loudly (e.g. if a v1 backup is accidentally named "cas") instead of
dismissing it as informational noise.

Sub-change B: add NameCollidesWithCASPrefix to pkg/cas/validate.go and call
it from both the CAS Upload path (pkg/cas/upload.go) and the v1 Upload path
(pkg/backup/upload.go::validateUploadParams). Rejects backup names that
exactly match a CAS skip-prefix segment (e.g. "cas" when root_prefix="cas/")
at upload time, before any I/O, with a descriptive error message.

Without this guard an operator could successfully create a v1 backup named
"cas", which would then be silently invisible to BackupList (and therefore
to v1 retention) once CAS is enabled — a hard-to-diagnose data loss scenario.

Tests:
  - TestUpload_RejectsNameCollidingWithCASPrefix in pkg/cas/upload_test.go
  - v1 path covered by integration (unit setup requires real BackupDestination)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…covery (Altinity#10)

Implements the operator escape hatch for a cas-upload in-progress marker left
behind by SIGKILL, OOM, or network partition. Without this, operators had to
delete the marker object manually via the storage console.

Changes:
  - pkg/cas/errors.go: add ErrNoInProgressMarker sentinel.
  - pkg/cas/unlock.go: new UnlockInProgress function — stats marker, logs
    Tool/Host/StartedAt, deletes, returns ErrNoInProgressMarker when absent.
  - pkg/backup/cas_methods.go: CASUpload gains unlock bool parameter; when
    true, calls UnlockInProgress and exits without uploading. Refuses
    --unlock + --dry-run and --unlock + --skip-object-disks.
  - cmd/clickhouse-backup/cas_commands.go: add --unlock BoolFlag to cas-upload.
  - pkg/server/actions_cas.go, cas_handlers.go: pass unlock=false at API
    call sites (unlock is a CLI-only escape hatch, not an API operation).

Tests:
  - TestUpload_UnlockRemovesInprogressMarker: pre-place marker → unlock →
    assert marker absent and no upload artifacts written.
  - TestUpload_UnlockRefusesWhenNoMarker: no marker → unlock → assert
    ErrNoInProgressMarker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ltinity#15)

Mirrors the pattern in Upload and Prune: each of the four CAS operations now
validates the config at function entry and wraps the error with a verb-specific
prefix ("cas: delete: invalid config: …" etc.).

Without this guard, callers that construct a Config and call these functions
without invoking Validate() first would silently operate with zero-duration
fields (GraceBlobDuration/AbandonThresholdDuration return 0 until Validate()
runs). Status and Verify don't use those durations today, but having the guard
everywhere makes the contract uniform and future-proof.

Test: TestConfig_DurationsZeroWithoutValidate in pkg/cas/config_test.go locks
the documented contract that duration accessors return 0 before Validate().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t per request (Altinity#8)

Extract CASProbeState (probeOnce/probeErr/bannerOnce) from Backuper into a
shared struct. APIServer holds a singleton CASProbeState created at startup;
every per-request NewBackuper call receives it via the WithCASProbeState opt,
so the conditional-put probe and unsafe-marker WARN banner are guaranteed to
fire at most once per server lifetime instead of once per REST call. CLI paths
create a fresh CASProbeState per NewBackuper (one-shot process, unchanged
semantics). Three new unit tests lock the shared-state contract.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…flag-combo guards

- Name-collision test: add t.Run case for 'casematch' (a name that
  prefix-matches 'cas' but is not equal to it) — locks the exact-match
  boundary so a future regression to HasPrefix-style matching is caught.

- Unlock flag-combo test: cover the --unlock+--dry-run and
  --unlock+--skip-object-disks refusals in cas-upload as a table-driven
  test on Backuper.CASUpload directly. The simple early-return guards
  in cas_methods.go are now regression-protected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e changes (Altinity#1, Altinity#7)

Add BREAKING CHANGES subsection at the top of vNEXT covering:
- pre-CAS binary downgrade hazard (silent deletion of cas/ namespace)
- RemoteStorage interface: two new required methods (PutFileAbsoluteIfAbsent, PutFileIfAbsent)
- BackupDestination.BackupList fourth skipPrefixes parameter
- v1 backup literally named "cas" silently filtered after upgrade

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…us via /backup/actions

§9.2: semaphore ctx-cancellation gap + --parallelism flag for PruneOptions
§9.4.x: atomicSwapDir misleading name + markerTool single-write contract
§9.5: casstorage.Walk absolute-key reconstruction contract test + root_prefix mid-deployment auto-detect
runbook: note that cas-status via POST /backup/actions returns only an ack;
  structured report is only available via GET /backup/cas-status

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three snapshot scenarios in cli.py compare full output to recorded
fixtures:

1. default_config — clickhouse-backup print-config now emits a 'cas:'
   block with 10 fields (enabled, cluster_id, root_prefix,
   inline_threshold, grace_blob, abandon_threshold, wait_for_prune,
   allow_unsafe_markers, skip_conditional_put_probe,
   allow_unsafe_object_disk_skip).

2. cli_usage — clickhouse-backup with no args now lists 7 cas-*
   commands (cas-upload, cas-download, cas-restore, cas-delete,
   cas-verify, cas-status, cas-prune) between 'server' and 'help'.

3. help_flag — clickhouse-backup --help shows the same expanded
   command list.

This was caught by the Testflows CI matrix at PR Altinity#1367. The
underlying Go integration suite already exercises CAS extensively;
testflows is the BDD-style end-to-end harness that was missing the
fixture refresh.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CI matrix exposed three v1 integration tests breaking on the CAS
branch:

- TestServerAPI listFormat regex did not include 'kind':"v1". The
  CAS work added a 'kind' field to /backup/list JSON entries so v1
  and CAS rows can be distinguished in a merged response. Update the
  expected regex in serverAPI_test.go.

- env.Cleanup() removed disk_s3 between tests but never touched the
  CAS namespace under <bucket>/backup/cluster/<shard>/cas/<cluster_id>/.
  v1 retention/clean-broken explicitly skips that prefix (by design,
  see SkipPrefixes in pkg/cas/config.go), so it persists across env-
  pool reuse and surfaces as a non-empty bucket in checkObjectStorageIsEmpty
  for the next non-CAS test on the same slot. Sweep every per-backend
  CAS path in Cleanup().

- TestCASUploadSkipObjectDisks t.Skip()'d after creating cas_skipod_db
  + local backup cas_skipod_bk, leaving them in place. The next v1
  test that did 'clickhouse-backup create' iterated ALL tables and
  failed CopyObject on the dangling object-disk stub. Move cleanup
  into t.Cleanup() so it runs on the skip path too.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Skip CAS tests on ClickHouse < 21.0 (lack of min_rows_for_wide_part /
  repeat() / system.disks columns).
- Skip TestCASRoundtripWithProjection on CH < 23.0 (system.projections
  added in 23.x).
- Drop "kind":"v1" from /backup/list output and rely on omitempty so legacy
  CH integration tables (CH < 21.1, no input_format_skip_unknown_fields)
  keep parsing the JSON. CAS rows still carry "kind":"cas".
- Update TestServerAPI listFormat regex to match the no-kind v1 layout.
- Cleanup(): also rmdir empty backup/ parent dirs after wiping cas/
  subtree on MinIO and fake-gcs-server, so checkObjectStorageIsEmpty
  doesn't trip on stale empty parents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous fix only handled some failure modes from the CI run. Honest
audit of the remaining failures across all 12 jobs surfaced three more
bugs:

1. Data race in TestCASUploadSkipObjectDisks: t.Cleanup runs AFTER the
   test function returns (and after `defer env.Cleanup` has already
   closed env.ch and returned the env to the pool). The next test
   acquiring that slot races on the shared env. Fixed by switching from
   t.Cleanup to a plain defer (LIFO order: runs before env.Cleanup).

2. Mid-flight CAS test failures (e.g. system.projections on CH < 23,
   t.Skip paths) leak databases (cas_proj_db, cas_skipod_db, ...) and
   local backups (cas_prune_smoke_bk, ...) to the next test on the
   same env-pool slot, breaking unrelated tests:
   - TestTablePatterns / TestCheckSystemPartsColumns SHOW CREATE every
     database, including the leaked cas_*_db ones (and choke on
     cas_skipod_db.remote object-disk table).
   - TestServerAPI counts local backups via /metrics, gets the wrong
     number when CAS local backups are still around.

   Fixed by adding a CAS-specific backstop to env.Cleanup: for any
   TestCAS* test, wipe /var/lib/clickhouse/backup/* and DROP all
   leaked cas_* databases at end-of-test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous fix used a 23.0 version cutoff, but system.projections was
actually added in ClickHouse 24.4. Replace the version check with a
runtime probe of system.tables so the test correctly skips on every
ClickHouse version that lacks the table, regardless of point-release
backports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment and skip message only — the runtime probe via system.tables is
unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous selective cleanup ('rm cas/' + 'find -empty -delete' +
'rmdir backup') wasn't fully clearing state on MinIO — TestS3 still
saw a leftover backup/ directory after TestCASMutationDedup ran on
the same env-pool slot.

CAS tests don't write v1-shaped state to the same bucket (only
cas/<cluster>/...), so for TestCAS* tests just rm -rf the entire
backup/ tree on each backend. Non-CAS tests keep the targeted
cas-only cleanup since their own test logic still uses backup/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The single failed check was 'Testflows (1.26, 22.3)' on
'/clickhouse backup/other engines/materializedpostgresql/I create
MaterializedPostgreSQL': SELECT against a MaterializedPostgreSQL
table hung past the 600s ExpectTimeoutError. This is upstream issue
ClickHouse#32902 territory and pre-existing — no testflows source
changes between master (last green at 92db680) and HEAD on this
branch's testflows tree. Empty commit to trigger a re-run; not a
fix because there's nothing branch-side to fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TestShadowCleanup{,OnFailure} run 'clickhouse-backup clean' which wipes
/var/lib/clickhouse/shadow/. ClickHouse then re-creates an empty
increment.txt as a side effect, and the NEXT test on the same env-pool
slot that runs FREEZE (TestSkipDisk, TestCustomKopia, ...) fails with:

  code: 32, message: File /var/lib/clickhouse/shadow/increment.txt is
  empty. You must fill it manually with appropriate value.

The flake hit different downstream tests on different CI runs (TestSkipDisk
on 22.8, TestCustomKopia on 21.3) — same root cause, intermittent because
it depends on whether ClickHouse has time to recreate the file before the
next env acquire. Pre-existing on master, not CAS-introduced.

In env.Cleanup, scrub the 0-byte increment.txt after TestShadowCleanup{,OnFailure}
so ClickHouse re-creates it cleanly on the next FREEZE.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant