ice: Add support for partition by and sort columns in create table and insert. by subkanthi · Pull Request #22 · Altinity/ice

subkanthi · 2025-04-24T03:48:33Z

closes: #20

subkanthi · 2025-04-24T03:49:48Z

Testing:
java -jar ../../ice/target/ice-0.0.0-SNAPSHOT-shaded.jar create-table flowers.irs_no_copy_partition --schema-from-parquet=file://iris.parquet --partition-by=variety

    }
  partition_spec_raw: |-
    [
      1000: variety: identity(5)
    ]

Multiple columns

java -jar ../../ice/target/ice-0.0.0-SNAPSHOT-shaded.jar create-table flowers.irs_no_copy_partition --schema-from-parquet=file://iris.parquet --partition-by=variety,petal.width

  partition_spec_raw: |-
    [
      1000: variety: identity(5)
      1001: petal.width: identity(4)
    ]

subkanthi · 2025-04-24T14:00:38Z

adding to README

…g columns.

shyiko · 2025-04-24T16:29:35Z

What about ice insert? Does it need to be updated?

shyiko · 2025-04-25T20:50:18Z

+              .commit();
+          var updatedSortOrder = table.replaceSortOrder();
+          for (String column : sortColumns) {
+            updatedSortOrder.asc(column);


what if order supposed to be desc?

…-key

… work with ice-rest-catalog.

subkanthi · 2025-05-02T12:41:14Z

Data with partition key.

The parquet file size is 3475226

SELECT count(*)
FROM url('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet', 'parquet')

Query id: 2100837d-987e-4118-b1d4-6551c398bb80

   ┌─count()─┐
1. │ 3475226 │ -- 3.48 millionthe display of the progress table.
   └─────────┘

1 row in set. Elapsed: 1.717 sec. 

:)

Spark(Data partitioned)

select * from taxis44 where vendorID=7;
**Time taken: 0.493 seconds, Fetched 1206 row(s)**

Spark(Data not partitioned)

select * from taxis where vendorID=7;

Time taken: 1.78 seconds, Fetched 2412 row(s)

…artitions.

…equired.

shyiko · 2025-05-08T01:04:48Z

+The spark-sql shell can now query the tables directory
+
+```
+docker exec -it <container_id> bash


I'd suggest updating compose file to include container_name: spark so that this part could be simplified to docker exec -it spark ./spark-sql (or docker exec -it spark spark-sql if spark-sql is on the PATH)

shyiko · 2025-05-08T01:07:13Z

+      List<IcePartition> partitions = new ArrayList<>();
+
+      if (sortOrderJson != null && !sortOrderJson.isEmpty()) {
+        ObjectMapper mapper = new ObjectMapper();


it's best to set https://github.com/Altinity/ice/blob/master/ice/src/main/java/com/altinity/ice/internal/config/Config.java#L37, otherwise any typos are silently ignored

shyiko · 2025-05-08T01:09:48Z

-import org.apache.iceberg.PartitionSpec;
-import org.apache.iceberg.Schema;
-import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.*;


let's try top avoid wildcard imports: https://google.github.io/styleguide/javaguide.html#s3.3.1-wildcard-imports

shyiko · 2025-05-08T01:16:40Z

    }
-    long dataFileSizeInBytes;
-    var dataFile = Strings.replacePrefix(file, "s3a://", "s3://");
+    long dataFileSizeInBytes = 0;


noCopy branch broken due to dataFileSizeInBytes being always 0

it also looks like we should fail the operation if noCopy is enabled and partitioning/sort-order is defined

this is still unresolved

added throw exception for f noCopy is enabled and partitioning/sort-order is defined

noCopy branch broken due to dataFileSizeInBytes being always 0

still applies (line 317)

shyiko · 2025-05-08T01:22:09Z

+                .withFileSizeInBytes(inFile.getLength())
+                .withPartition(partKey)
+                .withFormat(FileFormat.PARQUET)
+                .withRecordCount(records.size())


if I remember correctly recordCount is set implicitly from metrics

shyiko · 2025-05-08T01:28:59Z

+        new DataFiles.Builder(table.spec())
+            .withPath(dstDataFile)
+            .withFormat("PARQUET")
+            .withFileSizeInBytes(inFile.getLength())


extra requests just to get file length can be avoid by reading the value from fileappender, same as we do in https://github.com/Altinity/ice/pull/22/files#diff-efe5f830dfd30841f29c50a9f843a6c295aafb4bfd3d60202bb22f8680272686L390

changed to appender.length

shyiko · 2025-05-08T01:34:20Z

+    }
+
+    // Commit transaction.
+    txn.commitTransaction();


we should probably merge file appends into the same tx; otherwise we may have multiple TXs per ice insert

shyiko · 2025-05-08T01:36:05Z

          if (!finalOptions.noCommit()) {
            // TODO: log
-            if (atLeastOneFileAppended) {
+            if (atLeastOneFileAppended.get()) {


it doesn't look like this needs to be an atomic

shyiko · 2025-05-09T01:00:08Z

          appender.add(rec);
        }
+
+        fileSizeInBytes = appender.length();


this is incorrect. length() is guaranteed to return correct value only after close(). see javadocs

Are u referring to FileAppender java class,
/** Returns the length of this file. */
long length();

https://github.com/apache/iceberg/blob/9c8c431f830ed1f413abc355a316b73271385ccf/orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java#L141

yes appender.length before close, throws an error in spark. changed.

shyiko · 2025-05-09T01:04:16Z

+  spark-iceberg:
+    image: tabulario/spark-iceberg
+    container_name: spark-iceberg
+    build: spark/


there is no spark/ in this PR

its from iceberg repo, looks like build is ignored if image is defined, removed.

shyiko · 2025-05-09T01:06:15Z

+      @CommandLine.Option(
+              names = {"--partition"},
+              description =
+                  "JSON array of partition specifications: [{\"column\":\"date\",\"transform\":\"year\"}]")


can you please include a list of supported transforms in the description

shyiko · 2025-05-09T01:08:55Z

+          for (Main.IceSortOrder order : sortOrders) {
+            SortDirection dir = order.desc() ? SortDirection.DESC : SortDirection.ASC;
+            NullOrder nullOrd = order.nullFirst() ? NullOrder.NULLS_FIRST : NullOrder.NULLS_LAST;
+            if (dir == SortDirection.ASC) {


dir var appears to be unnecessary

idea is dir would default to DESC if its not passed.

this is just a nit, so keep it if you want but what I meant was: you can remove dir by changing if statement to

if (!order.desc()) {

shyiko · 2025-05-09T01:10:42Z

-import org.apache.iceberg.io.InputFile;
-import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.io.*;


wildcards are back

shyiko · 2025-05-09T01:13:21Z

+                for (DataFile df : dataFiles) {
                  atLeastOneFileAppended = true;
-                  appendOp.appendFile(df);
+                  appendOp.appendFile(df); // ✅ Only main thread appends now


✅ - ai assistant artifact?

shyiko · 2025-05-09T01:15:44Z

+    }
+
+    // Commit transaction.
+    txn.commitTransaction();


shyiko · 2025-05-09T01:16:28Z

+    }
+
+    // Commit transaction.
+    txn.commitTransaction();


noCommit flag ignored

shyiko · 2025-05-09T15:44:55Z

+              names = {"--partition"},
+              description =
+                  "JSON array of partition specifications: [{\"column\":\"date\",\"transform\":\"year\"}],"
+                      + "Supported transforms: hour, day, month, year, identity")


Suggested change

+ "Supported transforms: hour, day, month, year, identity")

+ "Supported transforms: hour, day, month, year, identity (default)")

shyiko · 2025-05-09T15:46:13Z

+      @CommandLine.Option(
+              names = {"--partition"},
+              description =
+                  "JSON array of partition specifications: [{\"column\":\"date\",\"transform\":\"year\"}]")


description is out of sync with create-table same field description

shyiko · 2025-05-09T15:54:04Z

+              replaceSortOrder.desc(order.column(), nullOrd);
+            }
+          }
+          replaceSortOrder.commit();


just to confirm: there is no way to pass sort-order as part of createTable request? it looks like if there is an issue with replaceSortOrder.commit(); above the table will be left with wrong configuration and no way to update without performing ice incert (because repeated createTable will throw AlreadyExistsException

I couldnt find a way, this was in the test class for iceberg, but the regular createTable doesnt take in sortOrder as parameter

Snippet of test class

@Test public void testUpdateSortOrder() { Schema schema = new Schema(Types.NestedField.required(10, "x", Types.StringType.get())); SortOrder order = SortOrder.builderFor(schema).asc("x").build(); TableMetadata sortedByX = TableMetadata.newTableMetadata( schema, PartitionSpec.unpartitioned(), order, null, ImmutableMap.of()); assertThat(sortedByX.sortOrders()).hasSize(1); assertThat(sortedByX.sortOrder().orderId()).isEqualTo(1); assertThat(sortedByX.sortOrder().fields()).hasSize(1); assertThat(sortedByX.sortOrder().fields().get(0).sourceId()).isEqualTo(1); assertThat(sortedByX.sortOrder().fields().get(0).direction()).isEqualTo(SortDirection.ASC); assertThat(sortedByX.sortOrder().fields().get(0).nullOrder()).isEqualTo(NullOrder.NULLS_FIRST); // build an equivalent order with the correct schema SortOrder newOrder = SortOrder.builderFor(sortedByX.schema()).asc("x").build(); TableMetadata alsoSortedByX = sortedByX.replaceSortOrder(newOrder); assertThat(sortedByX) .as("Should detect current

@Override public Table createTable( TableIdentifier ident, Schema schema, PartitionSpec spec, String location, Map<String, String> props) { return delegate.createTable(ident, schema, spec, location, props); } @Override public Table createTable( TableIdentifier ident, Schema schema, PartitionSpec spec, Map<String, String> props) { return delegate.createTable(ident, schema, spec, props); } @Override public Table createTable(TableIdentifier ident, Schema schema, PartitionSpec spec) { return delegate.createTable(ident, schema, spec); } @Override public Table createTable(TableIdentifier identifier, Schema schema) { return delegate.createTable(identifier, schema); }

shyiko

almost there 👍 the only blocker is https://github.com/Altinity/ice/pull/22/files#r2081978804

… partitioning.

… copy.

Changes: - ice create-table: Fix createTable not including sortOrder as part of createTable transaction. - ice insert: Fix create-table & insert generating different/incompatible partition specs. - ice insert: Fix "java.lang.IllegalArgumentException: Cannot add duplicate partition field" when trying to insert into a table with partition spec already set by create-table (--partition had to be specified for `ice insert` to reproduce). - ice insert: Fix "software.amazon.awssdk.services.s3.model.S3Exception: Object name contains unsupported characters" when trying to insert --partition data into an existing un-partitioned table (OutputFileFactory was missing tableSpec). - ice insert: Fix insert mutating table partitioning/ordering specs + write.distribution-mode even when there are no changes; - ice insert: Fix race condition resulting from multiple threads accessing the same Schema instance. - ice insert: Fix partitioning activity logging invalid "took Ns" values. - ice insert: Fix improper use of reuseContainer that could lead to invalid data being written to the catalog. - ice insert: Fix insert accepting new partitioning/ordering specs without rewriting existing data. - ice insert: Fix insert not following --data-file-naming-strategy=DEFAULT strategy - ice create-table/insert: Fix NPE when --partition is specified without "transform". - examples/docker-compose:spark: Fix invalid/incomplete spark configuration (spark conf was missing client.region, header.authorization, etc.) - examples/docker-compose:spark: Pin tabulario/spark-iceberg tag to reduce the risk of it breaking in the future. - examples/docker-compose:spark: Fix invalid `docker exec` command (./spark-sql didn't work). - examples/docker-compose:spark: Remove the need to manually edit spark config just to try things. - examples/docker-compose:spark: Explain what docker-compose-spark-iceberg.yaml is for and how to use it + spark-sql. - examples/docker-compose:spark: Remove redundant/copy&paste parts from docker-compose-spark-iceberg.yaml. - examples/scratch: Fix examples referencing non-existent options (like --partition-by). Future work: - Support --no-copy when partitioning - Support --data-file-naming-strategy=PRESERVE_ORIGINAL when partitioning

Add support for partition by columns in create table.

55d08e8

subkanthi linked an issue Apr 24, 2025 that may be closed by this pull request

ice: Support partitioning when tables are created #20

Closed

subkanthi requested a review from shyiko April 24, 2025 13:48

subkanthi marked this pull request as ready for review April 24, 2025 13:48

subkanthi removed the request for review from shyiko April 24, 2025 14:00

subkanthi marked this pull request as draft April 24, 2025 14:00

subkanthi added 2 commits April 24, 2025 10:39

Updated README with example to create table with multiple partitionin…

1a9ca7f

…g columns.

Made partition variables final.

5dfcdf9

subkanthi marked this pull request as ready for review April 24, 2025 14:43

ice: Added support to pass sort-by columns.

8b5463a

subkanthi changed the title ~~Add support for partition by columns in create table.~~ ice: Add support for partition by and sort columns in create table and insert. Apr 25, 2025

shyiko reviewed Apr 25, 2025

View reviewed changes

subkanthi added 4 commits April 25, 2025 19:12

ice: Updated README.md to document insert with sort-key and partition…

4a98c49

…-key

Added standalone spark-iceberg docker compose

6762798

ice-rest-catalog: Merged changes from master.

c07e50b

ice-rest-catalog: Added README instructions to setup spark-iceberg to…

991593b

… work with ice-rest-catalog.

ice: If partition columns are passed by the user, write the data as p…

8898f26

…artitions.

subkanthi marked this pull request as draft May 2, 2025 13:39

subkanthi added 6 commits May 2, 2025 09:44

ice: Removed PartitionWriter

bbc9c10

ice: Added logic to sort columns in insert.

68ebd1d

ice: Merged changes from master.

bc0c075

ice: Merged changes from master.

3775b5e

ice: Merged changes from master and fixed jar conflicts

b28af69

ice: Fixed fileSizeBytes since spark was throwing an error.

2128640

subkanthi marked this pull request as ready for review May 5, 2025 21:03

Fixed typo in variable name

2eab476

subkanthi added 4 commits May 7, 2025 15:42

ice: Formatting updates

9130285

ice: Replaced fileMetrics with footerMetrics.

8512ba5

ice: Reverted default of copy() when partitioning or sorting is not r…

65847f8

…equired.

ice: Fixed appendOp used in main thread and the thread pool threads.

5d25f55

subkanthi requested a review from shyiko May 8, 2025 01:24

shyiko reviewed May 8, 2025

View reviewed changes

ice: Addressed PR review comments.

290dd80

shyiko reviewed May 9, 2025

View reviewed changes

subkanthi added 5 commits May 8, 2025 22:56

ice: Addressed PR review comments.

9b9ed9b

ice: Addressed PR review comments.

3329c32

ice: Added objectMapper fail settings.

0706672

ice: Added supported transforms in partition argument.

fe371ad

ice: Fixed formatting errors.

b0abb1a

shyiko reviewed May 9, 2025

View reviewed changes

subkanthi added 4 commits May 9, 2025 12:42

ice: Added logic to explicitly convert datetime/timestamp to long for…

827fd73

… partitioning.

ice: Added logic to explicitly convert datetime/timestamp to long for…

d3e7648

… partitioning.

ice: move initialization of dataFileSizeInBytes.

701352e

ice: Reverted back removed line of setting dataFileSizeInBytes for s3…

1b30ce0

… copy.

shyiko approved these changes May 9, 2025

View reviewed changes

subkanthi merged commit bac2d6a into master May 9, 2025
1 check passed

shyiko deleted the 20-ice-support-partitioning-when-tables-are-created branch June 3, 2025 18:17

	+ "Supported transforms: hour, day, month, year, identity")
	+ "Supported transforms: hour, day, month, year, identity (default)")

Conversation

subkanthi commented Apr 24, 2025

Uh oh!

subkanthi commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

subkanthi commented Apr 24, 2025

Uh oh!

shyiko commented Apr 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

subkanthi commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

subkanthi May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shyiko May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shyiko May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

subkanthi commented Apr 24, 2025 •

edited

Loading

subkanthi commented May 2, 2025 •

edited

Loading

subkanthi May 8, 2025 •

edited

Loading

shyiko May 9, 2025 •

edited

Loading

shyiko May 9, 2025 •

edited

Loading