Skip to content

Cache dataTasks() result in NodeFileScanTask to avoid repeated list creation#4146

Merged
czy006 merged 3 commits intoapache:masterfrom
j1wonpark:fix/cache-data-tasks
Apr 15, 2026
Merged

Cache dataTasks() result in NodeFileScanTask to avoid repeated list creation#4146
czy006 merged 3 commits intoapache:masterfrom
j1wonpark:fix/cache-data-tasks

Conversation

@j1wonpark
Copy link
Copy Markdown
Contributor

@j1wonpark j1wonpark commented Mar 27, 2026

Why are the changes needed?

NodeFileScanTask.dataTasks() creates a new ArrayList via Stream.concat().collect(toList()) on every call, even though the contents are immutable after construction. Since the result doesn't change between calls, caching avoids redundant list creation.

Brief change log

  • Added lazy caching to NodeFileScanTask.dataTasks() with a cachedDataTasks field
  • Wrapped the cached result with Collections.unmodifiableList() to prevent external mutation
  • Added cache invalidation (cachedDataTasks = null) in addFile() to ensure correctness when tasks are added after construction

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate

  • Run test locally before making a pull request

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

…list creation

dataTasks() was creating a new List via Stream.concat().collect(toList())
on every call. Since the result is immutable after construction (or after
addFile() calls complete), cache it lazily and invalidate the cache in
addFile() when baseTasks or insertTasks are mutated.

Signed-off-by: Jiwon Park <jpark92@outlook.kr>
Signed-off-by: Jiwon Park <jpark92@outlook.kr>
Signed-off-by: Jiwon Park <jpark92@outlook.kr>
@j1wonpark j1wonpark force-pushed the fix/cache-data-tasks branch from 70582c3 to bcd212c Compare March 28, 2026 06:55
@czy006 czy006 merged commit dcab5b5 into apache:master Apr 15, 2026
6 checks passed
@j1wonpark j1wonpark deleted the fix/cache-data-tasks branch April 17, 2026 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants