Skip to content

Spark: Add compaction only benchmark - rewrite data files#16219

Open
varun-lakhyani wants to merge 3 commits into
apache:mainfrom
varun-lakhyani:compaction-benchmark
Open

Spark: Add compaction only benchmark - rewrite data files#16219
varun-lakhyani wants to merge 3 commits into
apache:mainfrom
varun-lakhyani:compaction-benchmark

Conversation

@varun-lakhyani
Copy link
Copy Markdown
Contributor

@varun-lakhyani varun-lakhyani commented May 5, 2026

None of the existing benchmarks solely measure compaction (rewrite data files) time.
Closest reference is spark/v4.1/spark/src/jmh/java/org/apache/iceberg/spark/action/IcebergSortCompactionBenchmark.javabut it still includes several other actions which add noise when trying to benchmark compaction performance itself.

Created a base class action/IcebergCompactionBenchmark.java which currently serves action/IcebergDataCompactionBenchmark.java and will also serve action/IcebergSortCompactionBenchmark.java when this PR gets merged

Currently, benchmarking is done using local FS without latency injection.
To benchmark under more realistic storage latency, there are two possible approaches:

  • latency injection
  • Switch to S3FileIO

For switching to S3FileIO, two functions from the base class action/IcebergCompactionBenchmark.java can be overridden:
Example (extraCatalogProperties) :

@Override
  protected Map<String, String> extraCatalogProperties() {
    return Map.of(
        "catalog-impl", "org.apache.iceberg.jdbc.JdbcCatalog",
        "uri", "jdbc:sqlite:/tmp/iceberg-compaction-benchmark-catalog.db",
        "io-impl", "org.apache.iceberg.aws.s3.S3FileIO",
        "client.region", "ap-south-1");
  }
  

Example (getCatalogWarehouse) :

@Override
  protected String getCatalogWarehouse() {
    return "s3a://location-to-bucket/path-to-destination/";
  }

@github-actions github-actions Bot added the spark label May 5, 2026
@varun-lakhyani varun-lakhyani changed the title Benchmark: Add compaction only benchmark - rewrite data files Spark : Add compaction only benchmark - rewrite data files May 5, 2026
@varun-lakhyani varun-lakhyani changed the title Spark : Add compaction only benchmark - rewrite data files Spark: Add compaction only benchmark - rewrite data files May 10, 2026
@varun-lakhyani
Copy link
Copy Markdown
Contributor Author

varun-lakhyani commented May 10, 2026

I wanted suggestions on whether latency injection or S3FileIO should be added as part of this PR/codebase so anyone can directly run this benchmark with more realistic latency.

Currently most viable path is overriding those 2 functions mentioned above or run with local fs.

I haven't implemented this as part of this PR since S3FileIO is not used in any existing benchmark in the codebase, but for this benchmark I eventually want compaction timing with real cloud storage latency instead of only local FS latency.

@varun-lakhyani
Copy link
Copy Markdown
Contributor Author

varun-lakhyani commented May 11, 2026

Objective is to Measure how the number of input files affects rewriteDataFiles latency, keeping total data volume fixed at 2,000,000 rows and varying the number of files those rows are repartitioned into - to measure the effect of cloud object-store latency on compaction with increasing file counts.

Machine: MacBook Pro, Apple M4 (10 cores, 16 GB RAM), macOS 26.3.1, OpenJDK 21.0.10
Benchmark: JMH SingleShotTime, 3 warmup + 10 measurement iterations, 1 thread, 1 fork
Data: 2,000,000 rows (fixed, 14.5Mb each), repartitioned into 250 / 500 / 1,000 / 2,000 files
Storage: Amazon S3 (ap-south-1), FileIO: S3FileIO, rewrite strategy: rewrite-all=true

Result graph is attached:

For S3
Code_Generated_Image (4)


For Storage as local filesystem without any latency it reaches maximum of 1.315s
Code_Generated_Image (3)

@RussellSpitzer
Copy link
Copy Markdown
Member

We discussed this a bit offline, and we talked about how it would also be good to have some data for the local file system to show how the effect of latency of the filesystem on the compaction operation.

@varun-lakhyani
Copy link
Copy Markdown
Contributor Author

We discussed this a bit offline, and we talked about how it would also be good to have some data for the local file system to show how the effect of latency of the filesystem on the compaction operation.

@RussellSpitzer Added new image in the above comment. Now it has images with exact same setup just storage as s3 and local fs

@RussellSpitzer
Copy link
Copy Markdown
Member

@varun-lakhyani It would probably help a little to show those two on the same graph, but we can save that for design documents for the improvements themselves

@varun-lakhyani
Copy link
Copy Markdown
Contributor Author

Actually this might be a little difficult to understand both in same graph as ranges for both are vastly different, minimum time of s3 graph is more than 100 times that of max localFs time.

Tried to represent in single graph (Maybe can improve representation as we go on)
Code_Generated_Image (5)

@RussellSpitzer
Copy link
Copy Markdown
Member

I think that's pretty good :) I think the log scale kind of shows the dramatic cost of the file opening.

@varun-lakhyani
Copy link
Copy Markdown
Contributor Author

Yes log kind of scale helps visualising it better.
Let me know any changes if required.

TODO after this gets merged is to refactor IcebergSortCompactionBenchmark to use IcebergCompactionBenchmark as base class - I have changes but think those will be better with separate PR rather then in this PR.

Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be interesting to run this against a real (not mock) s3 store, though it'll show up throttling too; there's a lot of underlying system issues to surface...which means that there'd be a lot of variation in results, and it'd be hard to explain why something was faster/slower. It'd potentially vary by time of day and load on your test region.

If you do want to compare performance between runs with different settings, JMH tabulate is good. I have a fork which generates an html page with hardened fetch of NPM chart.js (signature validation). Given recent supplychain attacks, downloading anything from an npm rep to run unrestricted on your local system is too dangerous to consider.

https://github.com/steveloughran/jmh-tabulate/tree/hardened

@Fork(1)
@State(Scope.Benchmark)
@BenchmarkMode(Mode.SingleShotTime)
@Timeout(time = 1000, timeUnit = TimeUnit.HOURS)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not a benchmark you want to run on your laptop, is it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally Its not supposed to run on local, Running on EMR could be the way.
Though I ran this on local for above given graphs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably just set this to 1 hour :)

@varun-lakhyani
Copy link
Copy Markdown
Contributor Author

varun-lakhyani commented May 12, 2026

@steveloughran
Thanks for review. I would take a look at jmh-tabulate
I ran it against real s3, actually codebase doesn't have any benchmark against s3 so I didn't push the code but I made it easily configurable.
To enable s3 benchmarking we need to override 2 function from baseclase IcebergCompactionBenchmark in IcebergDataCompactionBenchmark.

Graphs posted in comments are against real s3 with below mentioned configs
Storage: Amazon S3 (ap-south-1), FileIO: S3FileIO, rewrite strategy: rewrite-all=true

@RussellSpitzer
Copy link
Copy Markdown
Member

@steveloughran any other follow ups here? I know @varun-lakhyani is working on larger doc to motivate the rest of the parallel file opening work which we would also love additional reviewers on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants