Skip to content

[Test] Add test runner option --retain-on-failure to retain tests on failure#7376

Merged
gmarciani merged 1 commit into
aws:developfrom
gmarciani:wip/mgiacomo/3160/test-retain-on-failure-0504-1
May 18, 2026
Merged

[Test] Add test runner option --retain-on-failure to retain tests on failure#7376
gmarciani merged 1 commit into
aws:developfrom
gmarciani:wip/mgiacomo/3160/test-retain-on-failure-0504-1

Conversation

@gmarciani
Copy link
Copy Markdown
Contributor

@gmarciani gmarciani commented May 5, 2026

Description of changes

Add test runner option --retain-on-failure [N] to retain N tests on failure.
This test feature is useful to capture intermittent failures in a controlled way and retain the broken environments that are affected to facilitate investigations.

A recent example where it was necessary was the intermittent dcv session launcher failure, but it can also useful to capture the intermittent NFS port binding failure.

When the option is active, the test generates two files in the test workspace, that are used to both synchronize processes on the retention decision and also to facilitate the discovery of retained resources for the technician that is invetsigating the failure:

  • .retain_on_failure_tests: contains the list of failed tests for which resources have been retained.
  • .retained_stacks: contains the list of stack ARNs that have been retained.

NOTES

  1. This PR requires a corresponding change in Jenkins to make it work so it will not cleanup the resources that are supposed to be retained.

Tests

Executed 4 parallel instances of test_dcv_configuration with injected failures in all of them in the following scenarios:

  • w/o --retain-on-failure: failures are reported as usual, but no test stack is retained.
  • w/ --retain-on-failure 1: failures are reported as usual, cluster/vpc/iam stack retained for one single test failure.
  • w/ --retain-on-failure 2: failures are reported as usual, cluster/vpc/iam stack retained for two test failures.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@gmarciani gmarciani added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x Test labels May 5, 2026
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-retain-on-failure-0504-1 branch 5 times, most recently from cdbbf46 to 8ea3a03 Compare May 7, 2026 01:07
@gmarciani gmarciani marked this pull request as ready for review May 7, 2026 13:23
@gmarciani gmarciani requested review from a team as code owners May 7, 2026 13:23
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-retain-on-failure-0504-1 branch from 8ea3a03 to 45ab771 Compare May 7, 2026 15:32
logging.info("Registered stack %s for retention", stack_arn)


def _check_or_register_retain(request, test_id):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you you divided resources and stacks to handle the xdist workers and session level scope.

Can you retain this naming convention in the functions names you are using? It will be easier for the reader to understand the scope.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on changes to improve readability, but I am not getting the specific change you are asking :-)

for bucket in created_buckets:
if request.config.getoption("no_delete"):
logging.info(f"Not deleting S3 bucket {bucket[0]}")
elif request.config.getoption("retain_on_failure") and is_region_retained(request, bucket[1]):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non-blocking] There's quite a bit of code duplication in this conftest file — not introduced by this PR, but worth cleaning up in a follow-up.

return False


def _get_retain_counter_path(request):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmarciani gmarciani enabled auto-merge (rebase) May 14, 2026 07:37
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-retain-on-failure-0504-1 branch from 45ab771 to cce776f Compare May 18, 2026 15:12
@gmarciani gmarciani force-pushed the wip/mgiacomo/3160/test-retain-on-failure-0504-1 branch from cce776f to 1e63c7d Compare May 18, 2026 15:14
@gmarciani gmarciani merged commit 6ea6edf into aws:develop May 18, 2026
24 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3160/test-retain-on-failure-0504-1 branch May 18, 2026 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs Test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants