Skip to content

[test] Cherry-pick fixed for DCV, AD, Export-logs and EFA from develop#7395

Merged
himani2411 merged 15 commits into
aws:release-3.15from
himani2411:release-3.15
May 14, 2026
Merged

[test] Cherry-pick fixed for DCV, AD, Export-logs and EFA from develop#7395
himani2411 merged 15 commits into
aws:release-3.15from
himani2411:release-3.15

Conversation

@himani2411
Copy link
Copy Markdown
Contributor

@himani2411 himani2411 commented May 14, 2026

Description of changes

  • Cherry-picking Integration tests Fixes from develop branch

Test AD + EFA (move to hpc6a)

Tag_propogation:

Move to me-south-1 test

Export Logs

DCV

Tests

  • ONGOING
{%- import 'common.jinja2' as common with context -%}
{{- common.OSS_COMMERCIAL_X86.append("rocky8") or "" -}}
{{- common.OSS_COMMERCIAL_X86.append("rocky9") or "" -}}
---
test-suites:
  ad_integration:
    test_ad_integration.py::test_ad_integration:
      dimensions:
        - regions: [ "ap-southeast-1" ]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [ {{ LUSTRE_OS_X86_0 }}, {{ LUSTRE_OS_X86_2 }}, {{ LUSTRE_OS_X86_4 }}, {{ LUSTRE_OS_X86_6 }}]
          schedulers: ["slurm"]
  cli_commands:
    test_cli_commands.py::test_slurm_cli_commands:
      dimensions:
        - regions: ["ap-northeast-2"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ OS_X86_7 }}]
          schedulers: ["slurm"]
  dcv:
    test_dcv.py::test_dcv_configuration:
      dimensions:
        # DCV on GPU enabled instance
        - regions: [{{ g4dn_2xlarge_CAPACITY_RESERVATION_2_INSTANCES_1_HOURS_NOPG_DCV_OS_X86_1 }}]
          instances: ["g4dn.2xlarge"]
          oss: [{{ DCV_OS_X86_1 }}]
          schedulers: ["slurm"]
        # DCV on ARM + GPU
        - regions: [{{ g5g_2xlarge_CAPACITY_RESERVATION_2_INSTANCES_1_HOURS_NOPG_DCV_OS_X86_3 }}]
          instances: ["g5g.2xlarge"]
          oss: [{{ DCV_OS_X86_3 }}]
          schedulers: ["slurm"]
        # DCV in cn regions and non GPU enabled instance
        - regions: ["cn-northwest-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ DCV_OS_X86_2 }}]
          schedulers: ["slurm"]
        # DCV in gov-cloud regions and non GPU enabled instance
        - regions: ["us-gov-west-1"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ DCV_OS_X86_4 }}]
          schedulers: ["slurm"]
    test_dcv.py::test_dcv_with_remote_access:
      dimensions:
        - regions: ["ap-southeast-2"]
          instances: {{ common.INSTANCES_DEFAULT_X86 }}
          oss: [{{ DCV_OS_X86_1 }}]
          schedulers: ["slurm"]
tags:
    test_tag_propagation.py::test_tag_propagation:
      dimensions:
      - instances:
        - c5.xlarge
        oss:
        - alinux2023
        regions:
        - us-west-1
        schedulers:
        - awsbatch

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Himani Anil Deshpande and others added 13 commits May 13, 2026 14:53
…large on eu-north-1

to reduce the risk of insufficient capacity exceptions.
* We look for RUNNING and PENDING tasks only as there are ~19K historical export tasks with varying statuses.
…heck

Add a sleep before the ls command inside the switch-user session to allow
the PAM hook time to finish generating the user's SSH key. Without it, ls
may run before key generation completes, causing sporadic "Permission
denied" failures.
…figuration` and `text_dcv_remote_access`:

  * debuggability: retrieve, print and analyze a comprehensive report of crashes (not only the crash filename, but the stack trace of the crash).  Also, moved from hard assertions to soft assertions to have a final report of all the observed failures.
  * stability: prevent false positive failures, by ignoring harmless crashes related to gnome, unrelated to nvidia or dcv. Also fixed a gap that was causing failures when multiple instances of this test are executed in parallel by serializing the modifications to ssh known_hosts.
  * coverage: the test is now able to detect crashes on all supported OSs, not only Ubuntu.
…OME Tracker Suite (tracker-store),

as they are harmless crashes not related to either ParallelCluster or DCV.
Tolerate the known dcvsessionlauncher SEGV (g_subprocess_send_signal -> on_read_startup_string_ready) on g5g instances.

This crash has an intermittent impact only on the creation of the first dcv session.
When it is impactful, the check on dcv connectivity already fails.
@himani2411 himani2411 requested review from a team as code owners May 14, 2026 15:29
@himani2411 himani2411 added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels May 14, 2026
@himani2411 himani2411 changed the title Release 3.15 [test] Cherry-pick fixed for DCV, AD, Export-logs and EFA from develop May 14, 2026
gmarciani
gmarciani previously approved these changes May 14, 2026
@himani2411 himani2411 enabled auto-merge (rebase) May 14, 2026 18:41
@himani2411 himani2411 merged commit 0fae450 into aws:release-3.15 May 14, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.x skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants