Skip to content

libpcp: recover from corrupted archives in multi-archive contexts#2586

Open
kurik wants to merge 1 commit into
performancecopilot:mainfrom
kurik:corrupted-archive-recovery
Open

libpcp: recover from corrupted archives in multi-archive contexts#2586
kurik wants to merge 1 commit into
performancecopilot:mainfrom
kurik:corrupted-archive-recovery

Conversation

@kurik
Copy link
Copy Markdown
Contributor

@kurik kurik commented May 7, 2026

When reading multi-archive PCP logs (e.g. Cockpit metrics history), a corrupted volume no longer terminates the entire replay. Instead, corrupted archives are flagged and skipped so that data from healthy volumes can still be read.

  • Add 'corrupted' field to __pmMultiLogCtl to track damaged archives
  • Skip corrupt/unreadable archives during multi-archive context init with a warning rather than failing outright
  • Rebuild shared __pmLogCtl when lost due to failed archive opens
  • On PM_ERR_LOGREC in __pmLogRead, advance to the next (or previous) archive instead of returning an error in multi-archive mode
  • Skip known-corrupted archives in LogChangeToNextArchive() and LogChangeToPreviousArchive()
  • Update QA tests 722 and 1671 for new recovery behaviour

Resolves: RHEL-169855

Summary by CodeRabbit

  • Bug Fixes
    • Multi-archive operations now skip corrupted or unreadable archives, remove bad entries, and continue with remaining archives; shared log control is rebuilt when needed. Archive-corruption reports have been downgraded to warnings so processing can proceed; initialization fails only if no archives succeed.
  • Tests
    • QA improvements: stderr warnings are post-processed for readability during replay, and an archive-list test expectation was changed to expect failure.

Review Change Stack

@kurik kurik requested review from kmcdonell and natoscott May 7, 2026 08:54
@kurik
Copy link
Copy Markdown
Contributor Author

kurik commented May 7, 2026

@natoscott , @kmcdonell May I ask you for a review, please?

Comment thread src/libpcp/src/logutil.c Outdated
"__pmLogRead: corrupted archive \"%s\" vol %d, attempting to skip",
acp->ac_log->name, acp->ac_curvol);
if (acp->ac_cur_log >= 0 && acp->ac_cur_log < acp->ac_num_logs)
acp->ac_log_list[acp->ac_cur_log]->corrupted = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kurik I wonder if it would be simpler (and safer) to remove the corrupt archive from the list, instead of keeping it around flagging it as bad? If we keep it around there's always a chance it might be used somewhere accidentally when it should not be, whereas if we drop it completely from the list we cannot really go wrong can we?

@kmcdonell
Copy link
Copy Markdown
Member

[babble from slack]
@kurik More or less orthogonal to the fixes for libpcp, it may be worth (I'm not sure how, but particularly for RH Support) publicising that pmlogcheck -r or -R if you're brave, repairs the most common forms of archive corruption (the last record in one of the files is incomplete). Since I took my shell script hack and turned it into C code in pmlogcheck I've had 100% success in repairing corrupted archives and preserving as much data as is possible. This is something that happens infrequently, but regularly, in the QA Farm and allows pmlogger-daily to recover from corrupted archives (something that cannot be done in libpcp and something where we don't want to skip the corrupted archive if there is useful information up to the point of corruption). On reflection, I wonder if the pmlogcheck logic should be somehow incorporated into libpcp (some PCP_FOO_OPTION?) and this makes it not-so-orthogonal, so I'll add this comment to the PR so the thought bubble does not get lost.

If they are seeing corruption that is not of the "truncation" form I'd be very interested to know the details of the corruption.

@kurik kurik force-pushed the corrupted-archive-recovery branch from ebae07d to 989ad25 Compare May 19, 2026 06:57
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7f23a9fc-ec42-4533-adf7-4eca74551ad6

📥 Commits

Reviewing files that changed from the base of the PR and between ad73707 and 9e214bd.

⛔ Files ignored due to path filters (2)
  • qa/1671.out is excluded by !**/*.out
  • qa/722.out is excluded by !**/*.out
📒 Files selected for processing (5)
  • qa/1671
  • qa/722
  • src/libpcp/src/context.c
  • src/libpcp/src/interp.c
  • src/libpcp/src/logutil.c
🚧 Files skipped from review as they are similar to previous changes (5)
  • qa/722
  • src/libpcp/src/logutil.c
  • src/libpcp/src/interp.c
  • src/libpcp/src/context.c
  • qa/1671

📝 Walkthrough

Walkthrough

Initialization now skips unreadable archives with warnings; reads remove corrupted archives and retry using alternate archives; fetch operations downgrade corruption errors to warnings when multiple archives exist. QA tests normalize warning output and update an expectation to reflect the new behavior.

Changes

Multi-archive corruption recovery

Layer / File(s) Summary
Multi-archive context initialization with resilience
src/libpcp/src/context.c
initarchive() logs a warning and skips individual archive open failures in multi-archive mode instead of aborting; after processing all names it fails only if zero archives succeeded, and it rebuilds the shared log control if it was lost.
Corrupted archive removal and retry mechanism
src/libpcp/src/logutil.c
__pmLogRead_ctx() detects corrupted archive reads in multi-archive contexts and removes the failed archive from the context list (freeing metadata and control structs), updates bookkeeping (ac_num_logs, ac_cur_log, offsets/volumes), and retries by switching to the same-position next archive (forward) or previous archive/volume (backward) using a new read_retry label.
Fetch error handling for multi-archive contexts
src/libpcp/src/interp.c
__pmLogFetchInterp() downgrades corrupted-archive diagnostics to warnings and only returns the error when a context has one or fewer logs, allowing execution to continue past corruption when multiple archives exist.
QA test infrastructure and expectations
qa/1671, qa/722
QA tests add _filter_warnings() to normalize timestamps and PIDs in stderr output and replace cat $tmp.err with filtered output after each replay; the archive-list scenario expectation is updated from pass to fail.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through logs both old and torn,
Skipping bits that time has worn.
Warnings masked, the test beds cheer,
Archives pruned, the path is clear—
Resilience nibbling on the thorn.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main objective of the changeset: enabling recovery from corrupted archives in multi-archive PCP contexts rather than failing completely.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/libpcp/src/context.c`:
- Around line 947-957: The current code in context.c that calls
__pmFindOrOpenArchive(ctxp, current, multi_arch) treats every negative sts as a
recoverable "skip and continue" when multi_arch is true; instead, only treat
truly corrupt/unreadable errors as recoverable and propagate
semantic/incompatibility errors. Change the multi_arch branch to inspect sts and
only call pmNotifyErr(... "skipping corrupt/unreadable archive ...") and
continue when sts matches the specific recoverable error codes (e.g., corrupt or
I/O related codes returned by __pmFindOrOpenArchive); for all other negative sts
(label/host/schema mismatches or other semantic incompatibilities) log an error
and return sts (or otherwise fail the open) so the caller sees the failure.
Update the block around __pmFindOrOpenArchive, sts, multi_arch, current, end,
pmNotifyErr and pmErrStr to implement this conditional handling and propagate
non-recoverable errors instead of silently skipping them.

In `@src/libpcp/src/logutil.c`:
- Around line 1994-2003: The retry path inside the PM_MODE_FORW branch fails to
refresh the archive file offset after swapping archives: when
__pmLogChangeArchive(ctxp, corrupt_idx) succeeds we update lcp and f but leave
acp->ac_offset pointing to the removed archive; modify the branch that sets lcp
= acp->ac_log and f = acp->ac_mfp to also reset acp->ac_offset =
__pmLogLabelSize(lcp) (matching the other archive-switch paths) before setting
acp->ac_vol = acp->ac_curvol and goto again so subsequent seeks/read recovery
use the correct offset for the new file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a19c6ec9-8e44-41f7-8a44-15f2e38fe877

📥 Commits

Reviewing files that changed from the base of the PR and between 1f4bcc3 and 989ad25.

⛔ Files ignored due to path filters (2)
  • qa/1671.out is excluded by !**/*.out
  • qa/722.out is excluded by !**/*.out
📒 Files selected for processing (5)
  • qa/1671
  • qa/722
  • src/libpcp/src/context.c
  • src/libpcp/src/interp.c
  • src/libpcp/src/logutil.c

Comment thread src/libpcp/src/context.c
Comment thread src/libpcp/src/logutil.c
When reading multi-archive PCP logs a corrupted volume no longer
terminates the entire replay. Instead, corrupted archives are removed
from ac_log_list, so that data from healthy volumes can still be read.

Tests qa/1671 and qa/722 are modified accordingly to reflect this change.
@kurik kurik force-pushed the corrupted-archive-recovery branch from ad73707 to 9e214bd Compare May 19, 2026 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants