Spark: Add compaction only benchmark - rewrite data files by varun-lakhyani · Pull Request #16219 · apache/iceberg

varun-lakhyani · 2026-05-05T18:45:51Z

None of the existing benchmarks solely measure compaction (rewrite data files) time.
Closest reference is spark/v4.1/spark/src/jmh/java/org/apache/iceberg/spark/action/IcebergSortCompactionBenchmark.javabut it still includes several other actions which add noise when trying to benchmark compaction performance itself.

Created a base class action/IcebergCompactionBenchmark.java which currently serves action/IcebergDataCompactionBenchmark.java and will also serve action/IcebergSortCompactionBenchmark.java when this PR gets merged

Currently, benchmarking is done using local FS without latency injection.
To benchmark under more realistic storage latency, there are two possible approaches:

latency injection
Switch to S3FileIO

For switching to S3FileIO, two functions from the base class action/IcebergCompactionBenchmark.java can be overridden:
Example (extraCatalogProperties) :

@Override
  protected Map<String, String> extraCatalogProperties() {
    return Map.of(
        "catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog",
        "uri", "jdbc:sqlite:/tmp/iceberg-compaction-benchmark-catalog.db",
        "io-impl", "org.apache.iceberg.aws.s3.S3FileIO",
        "client.region", "ap-south-1");
  }

Example (getCatalogWarehouse) :

@Override
  protected String getCatalogWarehouse() {
    return "s3a://location-to-bucket/path-to-destination/";
  }

varun-lakhyani · 2026-05-10T21:05:40Z

I wanted suggestions on whether latency injection or S3FileIO should be added as part of this PR/codebase so anyone can directly run this benchmark with more realistic latency.

Currently most viable path is overriding those 2 functions mentioned above or run with local fs.

I haven't implemented this as part of this PR since S3FileIO is not used in any existing benchmark in the codebase, but for this benchmark I eventually want compaction timing with real cloud storage latency instead of only local FS latency.

varun-lakhyani · 2026-05-11T09:12:36Z

Objective is to Measure how the number of input files affects rewriteDataFiles latency, keeping total data volume fixed at 2,000,000 rows and varying the number of files those rows are repartitioned into - to measure the effect of cloud object-store latency on compaction with increasing file counts.

Machine: MacBook Pro, Apple M4 (10 cores, 16 GB RAM), macOS 26.3.1, OpenJDK 21.0.10
Benchmark: JMH SingleShotTime, 3 warmup + 10 measurement iterations, 1 thread, 1 fork
Data: 2,000,000 rows (fixed, 14.5Mb each), repartitioned into 250 / 500 / 1,000 / 2,000 files
Storage: Amazon S3 (ap-south-1), FileIO: S3FileIO, rewrite strategy: rewrite-all=true

Result graph is attached:

For S3

For Storage as local filesystem without any latency it reaches maximum of 1.315s

RussellSpitzer · 2026-05-11T16:27:21Z

We discussed this a bit offline, and we talked about how it would also be good to have some data for the local file system to show how the effect of latency of the filesystem on the compaction operation.

varun-lakhyani · 2026-05-11T17:11:40Z

We discussed this a bit offline, and we talked about how it would also be good to have some data for the local file system to show how the effect of latency of the filesystem on the compaction operation.

@RussellSpitzer Added new image in the above comment. Now it has images with exact same setup just storage as s3 and local fs

RussellSpitzer · 2026-05-11T17:33:22Z

@varun-lakhyani It would probably help a little to show those two on the same graph, but we can save that for design documents for the improvements themselves

varun-lakhyani · 2026-05-11T17:45:35Z

Actually this might be a little difficult to understand both in same graph as ranges for both are vastly different, minimum time of s3 graph is more than 100 times that of max localFs time.

Tried to represent in single graph (Maybe can improve representation as we go on)

RussellSpitzer · 2026-05-11T18:09:41Z

I think that's pretty good :) I think the log scale kind of shows the dramatic cost of the file opening.

varun-lakhyani · 2026-05-11T18:17:26Z

Yes log kind of scale helps visualising it better.
Let me know any changes if required.

TODO after this gets merged is to refactor IcebergSortCompactionBenchmark to use IcebergCompactionBenchmark as base class - I have changes but think those will be better with separate PR rather then in this PR.

steveloughran

it would be interesting to run this against a real (not mock) s3 store, though it'll show up throttling too; there's a lot of underlying system issues to surface...which means that there'd be a lot of variation in results, and it'd be hard to explain why something was faster/slower. It'd potentially vary by time of day and load on your test region.

If you do want to compare performance between runs with different settings, JMH tabulate is good. I have a fork which generates an html page with hardened fetch of NPM chart.js (signature validation). Given recent supplychain attacks, downloading anything from an npm rep to run unrestricted on your local system is too dangerous to consider.

https://github.com/steveloughran/jmh-tabulate/tree/hardened

steveloughran · 2026-05-12T09:46:23Z

+@Fork(1)
+@State(Scope.Benchmark)
+@BenchmarkMode(Mode.SingleShotTime)
+@Timeout(time = 1000, timeUnit = TimeUnit.HOURS)


that's not a benchmark you want to run on your laptop, is it?

Ideally Its not supposed to run on local, Running on EMR could be the way.
Though I ran this on local for above given graphs.

You could probably just set this to 1 hour :)

varun-lakhyani · 2026-05-12T11:10:30Z

@steveloughran
Thanks for review. I would take a look at jmh-tabulate
I ran it against real s3, actually codebase doesn't have any benchmark against s3 so I didn't push the code but I made it easily configurable.
To enable s3 benchmarking we need to override 2 function from baseclase IcebergCompactionBenchmark in IcebergDataCompactionBenchmark.

Graphs posted in comments are against real s3 with below mentioned configs
Storage: Amazon S3 (ap-south-1), FileIO: S3FileIO, rewrite strategy: rewrite-all=true

RussellSpitzer · 2026-05-12T21:56:35Z

@steveloughran any other follow ups here? I know @varun-lakhyani is working on larger doc to motivate the rest of the parallel file opening work which we would also love additional reviewers on

add compaction only benchmark

e6902b8

github-actions Bot added the spark label May 5, 2026

varun-lakhyani changed the title ~~Benchmark: Add compaction only benchmark - rewrite data files~~ Spark : Add compaction only benchmark - rewrite data files May 5, 2026

create a seperate base class + make S3Fileio pluggable

1e83e76

varun-lakhyani changed the title ~~Spark : Add compaction only benchmark - rewrite data files~~ Spark: Add compaction only benchmark - rewrite data files May 10, 2026

RussellSpitzer approved these changes May 11, 2026

View reviewed changes

steveloughran reviewed May 12, 2026

View reviewed changes

varun-lakhyani requested a review from steveloughran May 12, 2026 11:18

change timeout to 1 hr

737ad21

Conversation

varun-lakhyani commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varun-lakhyani commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varun-lakhyani commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented May 11, 2026

Uh oh!

varun-lakhyani commented May 11, 2026

Uh oh!

RussellSpitzer commented May 11, 2026

Uh oh!

varun-lakhyani commented May 11, 2026

Uh oh!

RussellSpitzer commented May 11, 2026

Uh oh!

varun-lakhyani commented May 11, 2026

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran May 12, 2026

Choose a reason for hiding this comment

Uh oh!

varun-lakhyani May 12, 2026

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer May 12, 2026

Choose a reason for hiding this comment

Uh oh!

varun-lakhyani commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

varun-lakhyani commented May 5, 2026 •

edited

Loading

varun-lakhyani commented May 10, 2026 •

edited

Loading

varun-lakhyani commented May 11, 2026 •

edited

Loading

varun-lakhyani commented May 12, 2026 •

edited

Loading