Skip to content

Spark 3.4: Pass FileIO on Spark's read path#16307

Merged
kevinjqliu merged 1 commit into
apache:mainfrom
kevinjqliu:spark-3.4-serializable-fileio
May 13, 2026
Merged

Spark 3.4: Pass FileIO on Spark's read path#16307
kevinjqliu merged 1 commit into
apache:mainfrom
kevinjqliu:spark-3.4-serializable-fileio

Conversation

@kevinjqliu
Copy link
Copy Markdown
Contributor

Backport of #15683 (and length fix #16284) to spark/v3.4.

Introduces SerializableFileIOWithSize to broadcast a table's FileIO to executors alongside the table metadata. Provides a KnownSizeEstimation so Spark skips the expensive SizeEstimator walk during broadcast, and makes close() a no-op on executors so broadcast cleanup does not destroy the driver's FileIO.

Adaptation note

v3.4's BaseReader still used the legacy table.encryption().decrypt(...) path. I switched that one method to fileIO.bulkDecrypt(...) to match v3.5/4.0/4.1, since the broadcast FileIO is now an EncryptingFileIO (combined in the constructor). All other files match the v3.5 patch byte-for-byte (with paths translated).

Validation

  • ./gradlew -DsparkVersions=3.4 :iceberg-spark:iceberg-spark-3.4_2.12:test --tests "*SerializableFileIOWithSize*" (new test, passes)
  • ./gradlew -DsparkVersions=3.4 :iceberg-spark:iceberg-spark-3.4_2.12:test --tests "org.apache.iceberg.spark.source.TestSparkReaderDeletes" (passes)
  • Compile-checked spark-extensions tests

Backport of apache#15683 (and length fix apache#16284) to spark/v3.4.

Note: BaseReader required an adaptation \u2014 v3.4 still used the legacy
table.encryption().decrypt(...) path. Switched it to fileIO.bulkDecrypt(...)
to match v3.5/4.0/4.1, since the broadcast FileIO is now an
EncryptingFileIO (combined in the constructor). All other files match the
v3.5 patch byte-for-byte (with paths translated).
@kevinjqliu
Copy link
Copy Markdown
Contributor Author

thanks for the review! since this is a backport and only for spark 3.4, im comfortable with merging it in as is 😄

@kevinjqliu kevinjqliu merged commit 062e822 into apache:main May 13, 2026
27 checks passed
@kevinjqliu kevinjqliu deleted the spark-3.4-serializable-fileio branch May 13, 2026 02:03
kevinjqliu added a commit to kevinjqliu/iceberg that referenced this pull request May 13, 2026
Backport of apache#15992 to spark/v3.4. Stacked on PR apache#16307 (apache#15683 SerializableFileIOWithSize), which is itself a backport.

Adaptations from the source PR:

- SparkMicroBatchStream.java was replaced wholesale with the v3.5 post-apache#15992 version because v3.4 had structural drift; the refactor extracts the planning logic into the new planner classes and there are no v3.4-only features in this file.

- TestStructuredStreamingRead3.java was likewise replaced with the v3.5 version (which adds parameterized sync/async coverage). The only non-mechanical change is using 'SparkCatalogConfig.SPARK' instead of 'SparkCatalogConfig.SPARK_SESSION', because v3.4 still uses the older enum name.
kevinjqliu added a commit that referenced this pull request May 13, 2026
Backport of #15992 to spark/v3.4. Stacked on PR #16307 (#15683 SerializableFileIOWithSize), which is itself a backport.

Adaptations from the source PR:

- SparkMicroBatchStream.java was replaced wholesale with the v3.5 post-#15992 version because v3.4 had structural drift; the refactor extracts the planning logic into the new planner classes and there are no v3.4-only features in this file.

- TestStructuredStreamingRead3.java was likewise replaced with the v3.5 version (which adds parameterized sync/async coverage). The only non-mechanical change is using 'SparkCatalogConfig.SPARK' instead of 'SparkCatalogConfig.SPARK_SESSION', because v3.4 still uses the older enum name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants