Skip to content

Core: update manifest delete file size after rewrite table action#15470

Open
mbutrovich wants to merge 14 commits into
apache:mainfrom
mbutrovich:delete_file_size_after_rewrite
Open

Core: update manifest delete file size after rewrite table action#15470
mbutrovich wants to merge 14 commits into
apache:mainfrom
mbutrovich:delete_file_size_after_rewrite

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented Feb 27, 2026

Which issue does this PR close?

Closes #12554.

Rationale for this change

rewriteTablePath rewrites position delete files (updating embedded data file paths), which changes their size. The manifest was written before the delete files were rewritten, so file_size_in_bytes in the manifest reflected the original size. Readers that trust this field (Trino, Impala, Comet, iceberg-rust) fail with errors like "end of stream not reached."

What changes are included in this PR?

Moves position delete file rewriting into the manifest-writing Spark task, eliminating the separate rewritePositionDeletes() Spark job. Each manifest-writing task now also rewrites the delete files that manifest references, measures the actual size via writer.length() (or getLength() on the staged file when another task got there first), and records it in the manifest entry.

This means:

  • No extra Spark job — down from two jobs to one
  • Parallelism preserved — one Spark task per manifest, same as before
  • Correct file_size_in_bytes at manifest write time, no reconciliation needed

Trade-off: if the same delete file appears in multiple manifests, it gets rewritten redundantly. The staging path is deterministic so the output is identical — wasted I/O, not incorrect. In practice this is rare.

How are these changes tested?

New test testDeleteFileSizeInBytesAfterRewrite creates a table with position deletes using a deeply nested path (so the rewritten path differs in length), runs rewriteTablePath, copies the result, and asserts that file_size_in_bytes in the rewritten manifest matches the actual file size on disk.

AI Usage

I am more familiar with the iceberg-rust codebase, so Claude helped me navigate the code, prototype a design, and draft the PR description (in DataFusion Comet's PR template). Claude also helped me with the API change failures in CI.

@mbutrovich mbutrovich changed the title Core: update delete file size after rewrite table action Core: update manifest delete file size after rewrite table action Feb 27, 2026
Comment thread .palantir/revapi.yml
Comment on lines +407 to +414
"1.10.0":
org.apache.iceberg:iceberg-api:
- code: "java.class.defaultSerializationChanged"
old: "class org.apache.iceberg.encryption.EncryptingFileIO"
new: "class org.apache.iceberg.encryption.EncryptingFileIO"
justification: "New method for Manifest List reading"
org.apache.iceberg:iceberg-core:
- code: "java.class.noLongerInheritsFromClass"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this moved unintentionally?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just what the ./gradlew revapi generated for me. As best as I can tell, this is what it did:

  1. The existing 1.10.0 block moved earlier in the file (alphabetical sort?)
  2. YAML indent changed from - code: (4-space) to - code: (2-space)
  3. Our 4 new entries added (java.element.noLongerDeprecated, two java.method.numberOfParametersChanged, one java.method.removed for rewriteDeleteManifest)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you keep only the four new entries and revert the reordering/re-indentation of the existing 1.10.0 block? Manually adding the new entries into the existing block (keeping the original 4-space indent and ordering) would make the diff much smaller and easier to review.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to mention that we must avoid breaking the API here. Please follow https://iceberg.apache.org/contribute/#minor-version-deprecations-required to avoid breaking the API

Comment thread core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Outdated
@mbutrovich mbutrovich force-pushed the delete_file_size_after_rewrite branch from 277c0e2 to 8922c70 Compare March 10, 2026 21:32
@mbutrovich mbutrovich requested a review from anuragmantri March 12, 2026 15:10
Comment thread .palantir/revapi.yml Outdated
Comment thread core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java Outdated
@steveloughran
Copy link
Copy Markdown
Contributor

hmmm. In #15586 I've been working on skipping the HEAD call within s3afs and azure abfs client openFile() calls, and had been thinking that the manifest data would be best as it'd save another 100 mS/file on what is often a critical path. We need accurate manifests for that.

// manifest was not updated. Readers that use file_size_in_bytes to elide a stat() call may
// fail.
@TestTemplate
public void testDeleteFileSizeInBytesAfterRewrite() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there equivalents of this for the other operations which add files to the manifests? if so they'd be handy being added for regression testing there too -though not in this PR

Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed again. I think changes within RewriteTablePathUtil can eliminate the need for a HEAD call with a v3 puffin file.

String stagingPath = stagingPath(file.location(), sourcePrefix, stagingLocation);
OutputFile outputFile = io.newOutputFile(stagingPath);
try {
rewritePositionDeleteFile(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this could actually return the length of the file written (which would need changes in rewriteDVFile() to do the same then there'd be no need for the HEAD call when rewriting puffin files.

it'd probably be needed in v2 rewrites, but here the getLength call should be pushed into rewritePositionDeleteFile() which would only invoke it on the v2 codepath.

…ask, eliminating the separate `rewritePositionDeletes()` Spark job. Each manifest-writing task now also rewrites the delete files that manifest references, measures the actual size via `getLength()`, and records it in the manifest entry.
…ustification "rewriteDeleteManifest signature changed to accept PositionDeleteReaderWriter for inline position delete rewriting; rewritePositionDeleteFile now returns file size to avoid getLength HEAD call".
@mbutrovich mbutrovich force-pushed the delete_file_size_after_rewrite branch from 58725c3 to 78107a1 Compare April 15, 2026 18:11
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change.

@wypoon
Copy link
Copy Markdown
Contributor

wypoon commented Apr 15, 2026

@mbutrovich as you're changing the signature of RewriteTablePathUtil.rewriteDeleteManifest (in core), and it is used by Spark, Flink, and Delta conversion, you need to update all the callers. Unfortunately, this means updating all supported Spark and Flink versions. Normally we make changes to the latest supported Spark or Flink version first when making a Spark or Flink change, and then backport to earlier versions in separate PRs, but in this case, it needs to be done all together.

@mbutrovich
Copy link
Copy Markdown
Contributor Author

mbutrovich commented Apr 15, 2026

@mbutrovich as you're changing the signature of RewriteTablePathUtil.rewriteDeleteManifest (in core), and it is used by Spark, Flink, and Delta conversion, you need to update all the callers. Unfortunately, this means updating all supported Spark and Flink versions. Normally we make changes to the latest supported Spark or Flink version first when making a Spark or Flink change, and then backport to earlier versions in separate PRs, but in this case, it needs to be done all together.

Sounds good, thanks @wypoon! @anuragmantri suggested I do the change in one version first to make it easier for reviewers (reduce the diff), get consensus on the approach, and then once folks are good with it I can generate the changes for the other versions. I'm happy to do that at any time.

@wypoon
Copy link
Copy Markdown
Contributor

wypoon commented Apr 15, 2026

@mbutrovich it is true that having changes for multiple versions of Spark and Flink clutters up a PR and makes reviewing more difficult. However, you can at least put up the changes for one version of Flink and for Delta conversion, and then later for all the versions.
I haven't made changes to the GitHub actions before, but I wonder if you changed, e.g., https://github.com/apache/iceberg/blob/main/.github/workflows/spark-ci.yml#L80 to run only for 4.1 in the PR, if it affects the CI for the PR. You can try it (changing the workflows as a temporary measure, to be reverted in a final version) and see if the CI succeeds.

@sreejasahithi
Copy link
Copy Markdown

sreejasahithi commented May 11, 2026

@mbutrovich
Since this fixes a correctness issue with file_size_in_bytes, will this also be backported to the 1.10.x branch since this branch also has the issue?

@mbutrovich
Copy link
Copy Markdown
Contributor Author

@mbutrovich Since this fixes a correctness issue with file_size_in_bytes, will this also be backported to the 1.10.x branch since this branch also has the issue?

This is my first upstream contribution to Iceberg Java, so happy to do so if the community requires. Is there a process to propose that besides the mailing list?

@sreejasahithi
Copy link
Copy Markdown

@mbutrovich Since this fixes a correctness issue with file_size_in_bytes, will this also be backported to the 1.10.x branch since this branch also has the issue?

cc @steveloughran @anuragmantri

@mbutrovich
Copy link
Copy Markdown
Contributor Author

CC @huaxingao @szehon-ho

@anuragmantri
Copy link
Copy Markdown
Contributor

I took another look. The PR looks good to me.

Is there a process to propose that besides the mailing list?

The process involves merging the PR into the main branch and then creating a PR against the 1.10.x branch. I don’t believe the 1.10.2 maintenance release has been released yet. If we can make it before that, we could include this one in the release. You can also reply to the thread if you’d like to @mbutrovich

https://lists.apache.org/thread/5442q5bddyoss9mw27r0zbnobysfpn0t

// Another task in this Spark job already staged this file. Rewriting is deterministic, so
// its content (and therefore length) match what this task would have produced.
if (e.getCause() instanceof FileAlreadyExistsException) {
return io.newInputFile(stagingPath).getLength();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steveloughran flagged the HDFS in-progress-write issue earlier:

There's actually something bad with hdfs here where the length of in-progress-writes can underreport length...it's only after close() that everything syncs up.

I think the concern applies here too. If Task A is still writing when Task B reaches getLength(), B reads an in‑progress length on HDFS, then you will get the same class of bug this PR is fixing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing size rewrite in rewrite_table_path for delete file

7 participants