Spark: Add compaction only benchmark - rewrite data files#16219
Spark: Add compaction only benchmark - rewrite data files#16219varun-lakhyani wants to merge 3 commits into
Conversation
|
I wanted suggestions on whether latency injection or Currently most viable path is overriding those 2 functions mentioned above or run with local fs. I haven't implemented this as part of this PR since |
|
We discussed this a bit offline, and we talked about how it would also be good to have some data for the local file system to show how the effect of latency of the filesystem on the compaction operation. |
@RussellSpitzer Added new image in the above comment. Now it has images with exact same setup just storage as s3 and local fs |
|
@varun-lakhyani It would probably help a little to show those two on the same graph, but we can save that for design documents for the improvements themselves |
|
I think that's pretty good :) I think the log scale kind of shows the dramatic cost of the file opening. |
|
Yes log kind of scale helps visualising it better. TODO after this gets merged is to refactor IcebergSortCompactionBenchmark to use IcebergCompactionBenchmark as base class - I have changes but think those will be better with separate PR rather then in this PR. |
steveloughran
left a comment
There was a problem hiding this comment.
it would be interesting to run this against a real (not mock) s3 store, though it'll show up throttling too; there's a lot of underlying system issues to surface...which means that there'd be a lot of variation in results, and it'd be hard to explain why something was faster/slower. It'd potentially vary by time of day and load on your test region.
If you do want to compare performance between runs with different settings, JMH tabulate is good. I have a fork which generates an html page with hardened fetch of NPM chart.js (signature validation). Given recent supplychain attacks, downloading anything from an npm rep to run unrestricted on your local system is too dangerous to consider.
| @Fork(1) | ||
| @State(Scope.Benchmark) | ||
| @BenchmarkMode(Mode.SingleShotTime) | ||
| @Timeout(time = 1000, timeUnit = TimeUnit.HOURS) |
There was a problem hiding this comment.
that's not a benchmark you want to run on your laptop, is it?
There was a problem hiding this comment.
Ideally Its not supposed to run on local, Running on EMR could be the way.
Though I ran this on local for above given graphs.
There was a problem hiding this comment.
You could probably just set this to 1 hour :)
|
@steveloughran Graphs posted in comments are against real s3 with below mentioned configs |
|
@steveloughran any other follow ups here? I know @varun-lakhyani is working on larger doc to motivate the rest of the parallel file opening work which we would also love additional reviewers on |



None of the existing benchmarks solely measure compaction (rewrite data files) time.
Closest reference is
spark/v4.1/spark/src/jmh/java/org/apache/iceberg/spark/action/IcebergSortCompactionBenchmark.javabut it still includes several other actions which add noise when trying to benchmark compaction performance itself.Created a base class
action/IcebergCompactionBenchmark.javawhich currently servesaction/IcebergDataCompactionBenchmark.javaand will also serveaction/IcebergSortCompactionBenchmark.javawhen this PR gets mergedCurrently, benchmarking is done using local FS without latency injection.
To benchmark under more realistic storage latency, there are two possible approaches:
For switching to S3FileIO, two functions from the base class
action/IcebergCompactionBenchmark.javacan be overridden:Example (extraCatalogProperties) :
Example (getCatalogWarehouse) :