CA-426408: Consider chunks in task expiry calculation by changlei-li · Pull Request #7074 · xapi-project/xen-api

changlei-li · 2026-05-15T06:22:52Z

A test of VM with 239 VBDs (1 system disk + 238 iSCSI VBDs) fails because of Timed out while waiting on task xxx. In the fail task:

VM.start (task 61)
  → perform_atomics([..., Parallel("Devices.plug"), ...])
    → perform_atomic(Parallel("Devices.plug"))
      ↓
      Parallel:task=61.atoms=2.(Devices.plug (no qemu))
      │
      │ queue_atomics_and_wait(ops=[atom0, atom1])
      │   expiration(atom0) = atomic_expires_after(Nested_parallel(239))
      │                     = Float.max(3600, ...) = 3600s  ← BUG
      │   event_wait(task=79, timeout=3600s)
      │
      ├─ task 79 (worker): Nested_parallel:task=79.atoms=239.(VBDs.activate_epoch_and_plug RW)
      │    │
      │    │ queue_atomics_and_wait(239 ops, chunks of 10)
      │    ├─ chunk 0:  10 × Serial(VBD_set_active, VBD_epoch_begin, VBD_plug)
      │    ├─ chunk 1:  10 × ...
      │    ├─ ...
      │    └─ chunk 23:  9 × ...
      │    └─ total: 82 minutes
      │
      └─ task 80 (worker): Serial(VIF_set_active, VIF_plug)

The task expiry calculation for parallel and nested_parallel doesn't consider the multi chunks. Just get the max expiry in the parallel task list. In fact the parallel task split to 24 chunks to run.

A test of VM with 239 VBDs (1 system disk + 238 iSCSI VBDs) fails because of `Timed out while waiting on task xxx`. In the fail task: ``` VM.start (task 61) → perform_atomics([..., Parallel("Devices.plug"), ...]) → perform_atomic(Parallel("Devices.plug")) ↓ Parallel:task=61.atoms=2.(Devices.plug (no qemu)) │ │ queue_atomics_and_wait(ops=[atom0, atom1]) │ expiration(atom0) = atomic_expires_after(Nested_parallel(239)) │ = Float.max(3600, ...) = 3600s ← BUG │ event_wait(task=79, timeout=3600s) │ ├─ task 79 (worker): Nested_parallel:task=79.atoms=239.(VBDs.activate_epoch_and_plug RW) │ │ │ │ queue_atomics_and_wait(239 ops, chunks of 10) │ ├─ chunk 0: 10 × Serial(VBD_set_active, VBD_epoch_begin, VBD_plug) │ ├─ chunk 1: 10 × ... │ ├─ ... │ └─ chunk 23: 9 × ... │ └─ total: 82 minutes │ └─ task 80 (worker): Serial(VIF_set_active, VIF_plug) ``` The task expiry calculation for parallel and nested_parallel doesn't consider the multi chunks. Just get the max expiry in the parallel task list. In fact the parallel task split to 24 chunks to run. Signed-off-by: Changlei Li <changlei.li@citrix.com>

changlei-li · 2026-05-15T06:50:53Z

More log

2026-04-18T17:27:30.471958+00:00 gpuuk-43-04d xenopsd-xc: [debug||10 |Async.VM.start R:817db30a35e3|xenops_server] begin_Parallel:task=61.atoms=2.(Devices.plug (no qemu) VM=76c45aa2-26b8-29bb-6ce0-21b74e2fa867)
...
2026-04-18T17:27:30.489186+00:00 gpuuk-43-04d xenopsd-xc: [debug||11 |Parallel:task=61.atoms=2.(Devices.plug (no qemu) VM=76c45aa2-26b8-29bb-6ce0-21b74e2fa867)|xenops_server] begin_Nested_parallel:task=79.atoms=239.(VBDs.activate_epoch_and_plug RW VM=76c45aa2-26b8-29bb-6ce0-21b74e2fa867)
...
2026-04-18T18:27:30.831718+00:00 gpuuk-43-04d xenopsd-xc: [debug||10 |Async.VM.start R:817db30a35e3|xenops_server] end_Parallel:task=61.atoms=2.(Devices.plug (no qemu) VM=76c45aa2-26b8-29bb-6ce0-21b74e2fa867)
2026-04-18T18:27:30.831811+00:00 gpuuk-43-04d xenopsd-xc: [error||10 |Async.VM.start R:817db30a35e3|xenops_server] Timed out while waiting on task 79 (Parallel:task=61.atoms=2.(Devices.plug (no qemu) VM=76c45aa2-26b8-29bb-6ce0-21b74e2fa867))
...
2026-04-18T18:49:52.149702+00:00 gpuuk-43-04d xenopsd-xc: [debug||11 |Parallel:task=61.atoms=2.(Devices.plug (no qemu) VM=76c45aa2-26b8-29bb-6ce0-21b74e2fa867)|xenops_server] end_Nested_parallel:task=79.atoms=239.(VBDs.activate_epoch_and_plug RW VM=76c45aa2-26b8-29bb-6ce0-21b74e2fa867)

last-genius · 2026-05-15T07:13:16Z

I wonder if this PR allows to increase the number of VBDs in the limit test back to 255? I remember testing an increased VBD limit with #6577 , and limit tests failing because of timeouts.

lindig · 2026-05-15T09:20:31Z

Could you summarise how the timeout was calculated before and after the change? I have not understood the difference. Do we want a timeout for the entire set of of plus, or per individual plug? And is the new idea to batch plus into groups, each of which has now a timeout?

minglumlu · 2026-05-15T09:30:31Z

+        (List.length ops + max_parallel_atoms - 1) / max_parallel_atoms
+      in
+      let per_chunk =
+        List.map atomic_expires_after ops |> List.fold_left Float.max 0.


This is the max of the individual ops. It's one of the ops, not a chunk.

For example, if I understand correctly, it could be something like:

let expires_of = List.map atomic_expires_after in let max_of = List.fold_left Float.max 0. in Xenops_utils.chunks max_parallel_atoms ops |> List.map (fun ops -> ops |> expires_of |> max_of) |> List.fold_left ( +. ) 0.

You are right. This is more accurate.

Is it correct to say the timeout calculation follows the structure of the operation? For parallel ops take the maximum, for sequential ops take the sum - and this is evaluated bottom up over this tree of operations.

I understand the parallel ops are split as chunks. The real parallel happens within each chunk. Chunks are serialized. Like there are 100 threads in parallel from programmer's point of view, but actually only 10 CPU cores under the hood.

changlei-li · 2026-05-15T09:55:02Z

Could you summarise how the timeout was calculated before and after the change? I have not understood the difference. Do we want a timeout for the entire set of of plus, or per individual plug? And is the new idea to batch plus into groups, each of which has now a timeout?

Hi @lindig. In the original code
For task61, it has atom0: Nested_parallel(239 tasks). To calculate the timeout, it go though the Nested_parallel 239 tasks. Since every task has 3 Serial ops. So every task 60 min. Then it get the max timeout of all the task list, that is, 60 min as task61 timeout.

task79 is atom0 of task61. When it runs, it splits to 24 chunks, and executes in serial. In one chunks, 10 tasks executes in parallel. In every chunk, it calculates timeout, (will also be 60 min), so task 79 can complete(finally takes 82 min in the test), but task 61 can't wait so long.

In the view of task61, it thinks all Nested_parallel(239 tasks) execute in parallel. But that is wrong. It splits to 24 chunks in serial.

psafont

The code previously expected all the tasks to happen in parallel, instead of being serialized in chunks, where tasks are parallelized.

The new approach is more lax than necessary when calculating the timeout. This is because uses the timeout for the longest atom for each chunk, instead of taking the longest of each chunk, and then adding it. This might cause timeouts to trigger much later if a task gets hung, but it's difficult to know if it's going to be a big issue or not.

changlei-li · 2026-05-15T10:48:03Z

The code previously expected all the tasks to happen in parallel, instead of being serialized in chunks, where tasks are parallelized.

The new approach is more lax than necessary when calculating the timeout. This is because uses the timeout for the longest atom for each chunk, instead of taking the longest of each chunk, and then adding it. This might cause timeouts to trigger much later if a task gets hung, but it's difficult to know if it's going to be a big issue or not.

@psafont I will change to the calculation of Ming's comment, i.e. taking the longest of each chunk, and then adding it you mentioned and have a test.

Signed-off-by: Changlei Li <changlei.li@citrix.com>

changlei-li · 2026-05-20T04:56:55Z

test:
New timeout for 24 chunks is 86400. No timeout error now.

2026-05-20T00:52:23.772418+00:00 genuk-21-09d xenopsd-xc: [debug||34 |Async.VM.start R:8491a7289fd6|xenops_server] begin_Parallel:task=64.atoms=2.(Devices.plug (no qemu) VM=839491b4-3069-4e26-e609-2e6718f0d3a5)
2026-05-20T00:52:23.779293+00:00 genuk-21-09d xenopsd-xc: [debug||34 |Async.VM.start R:8491a7289fd6|xenops_server] Task 90 expires_after=86400.000000
2026-05-20T00:52:23.782596+00:00 genuk-21-09d xenopsd-xc: [debug||43 |Parallel:task=64.atoms=2.(Devices.plug (no qemu) VM=839491b4-3069-4e26-e609-2e6718f0d3a5)|xenops_server] begin_Nested_parallel:task=90.atoms=239.(VBDs.activate_epoch_and_plug RW VM=839491b4-3069-4e26-e609-2e6718f0d3a5)
...
2026-05-20T03:23:22.279141+00:00 genuk-21-09d xenopsd-xc: [debug||43 |Parallel:task=64.atoms=2.(Devices.plug (no qemu) VM=839491b4-3069-4e26-e609-2e6718f0d3a5)|xenops_server] end_Nested_parallel:task=90.atoms=239.(VBDs.activate_epoch_and_plug RW VM=839491b4-3069-4e26-e609-2e6718f0d3a5)
2026-05-20T03:23:22.282805+00:00 genuk-21-09d xenopsd-xc: [debug||34 |Async.VM.start R:8491a7289fd6|xenops_server] end_Parallel:task=64.atoms=2.(Devices.plug (no qemu) VM=839491b4-3069-4e26-e609-2e6718f0d3a5)

changlei-li requested a review from psafont May 15, 2026 06:22

minglumlu reviewed May 15, 2026

View reviewed changes

psafont approved these changes May 15, 2026

View reviewed changes

Change to accumulate max timout of every chunk

c7741bc

Signed-off-by: Changlei Li <changlei.li@citrix.com>

psafont approved these changes May 19, 2026

View reviewed changes

minglumlu approved these changes May 20, 2026

View reviewed changes

changlei-li added this pull request to the merge queue May 20, 2026

Merged via the queue into xapi-project:master with commit 54a5bce May 20, 2026
16 checks passed

changlei-li deleted the private/changleli/calculate-task-expiry branch May 20, 2026 05:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA-426408: Consider chunks in task expiry calculation#7074

CA-426408: Consider chunks in task expiry calculation#7074
changlei-li merged 2 commits into
xapi-project:masterfrom
changlei-li:private/changleli/calculate-task-expiry

changlei-li commented May 15, 2026

Uh oh!

changlei-li commented May 15, 2026

Uh oh!

last-genius commented May 15, 2026

Uh oh!

lindig commented May 15, 2026

Uh oh!

minglumlu May 15, 2026

Uh oh!

minglumlu May 15, 2026 •

edited

Loading

Uh oh!

changlei-li May 15, 2026

Uh oh!

lindig May 15, 2026

Uh oh!

minglumlu May 15, 2026 •

edited

Loading

Uh oh!

changlei-li commented May 15, 2026

Uh oh!

psafont left a comment

Uh oh!

changlei-li commented May 15, 2026

Uh oh!

changlei-li commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

changlei-li commented May 15, 2026

Uh oh!

changlei-li commented May 15, 2026

Uh oh!

last-genius commented May 15, 2026

Uh oh!

lindig commented May 15, 2026

Uh oh!

minglumlu May 15, 2026

Choose a reason for hiding this comment

Uh oh!

minglumlu May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

changlei-li May 15, 2026

Choose a reason for hiding this comment

Uh oh!

lindig May 15, 2026

Choose a reason for hiding this comment

Uh oh!

minglumlu May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

changlei-li commented May 15, 2026

Uh oh!

psafont left a comment

Choose a reason for hiding this comment

Uh oh!

changlei-li commented May 15, 2026

Uh oh!

changlei-li commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

minglumlu May 15, 2026 •

edited

Loading

minglumlu May 15, 2026 •

edited

Loading