Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
55d08e8
Add support for partition by columns in create table.
subkanthi Apr 24, 2025
1a9ca7f
Updated README with example to create table with multiple partitionin…
subkanthi Apr 24, 2025
5dfcdf9
Made partition variables final.
subkanthi Apr 24, 2025
8b5463a
ice: Added support to pass sort-by columns.
subkanthi Apr 25, 2025
4a98c49
ice: Updated README.md to document insert with sort-key and partition…
subkanthi Apr 25, 2025
6762798
Added standalone spark-iceberg docker compose
subkanthi Apr 30, 2025
c07e50b
ice-rest-catalog: Merged changes from master.
subkanthi Apr 30, 2025
991593b
ice-rest-catalog: Added README instructions to setup spark-iceberg to…
subkanthi Apr 30, 2025
8898f26
ice: If partition columns are passed by the user, write the data as p…
subkanthi May 2, 2025
bbc9c10
ice: Removed PartitionWriter
subkanthi May 2, 2025
68ebd1d
ice: Added logic to sort columns in insert.
subkanthi May 5, 2025
bc0c075
ice: Merged changes from master.
subkanthi May 5, 2025
3775b5e
ice: Merged changes from master.
subkanthi May 5, 2025
b28af69
ice: Merged changes from master and fixed jar conflicts
subkanthi May 5, 2025
2128640
ice: Fixed fileSizeBytes since spark was throwing an error.
subkanthi May 5, 2025
2eab476
Fixed typo in variable name
subkanthi May 5, 2025
174b420
Merge branch 'master' of github.com:Altinity/ice into 20-ice-support-…
subkanthi May 6, 2025
f7651b6
ice: Replaced SortAscending/Descending with JSON that accepts sort or…
subkanthi May 7, 2025
5a29eb4
ice: Change partition option to a json that accepts column name and t…
subkanthi May 7, 2025
9130285
ice: Formatting updates
subkanthi May 7, 2025
8512ba5
ice: Replaced fileMetrics with footerMetrics.
subkanthi May 7, 2025
65847f8
ice: Reverted default of copy() when partitioning or sorting is not r…
subkanthi May 8, 2025
5d25f55
ice: Fixed appendOp used in main thread and the thread pool threads.
subkanthi May 8, 2025
290dd80
ice: Addressed PR review comments.
subkanthi May 8, 2025
9b9ed9b
ice: Addressed PR review comments.
subkanthi May 9, 2025
3329c32
ice: Addressed PR review comments.
subkanthi May 9, 2025
0706672
ice: Added objectMapper fail settings.
subkanthi May 9, 2025
fe371ad
ice: Added supported transforms in partition argument.
subkanthi May 9, 2025
b0abb1a
ice: Fixed formatting errors.
subkanthi May 9, 2025
827fd73
ice: Added logic to explicitly convert datetime/timestamp to long for…
subkanthi May 9, 2025
d3e7648
ice: Added logic to explicitly convert datetime/timestamp to long for…
subkanthi May 9, 2025
701352e
ice: move initialization of dataFileSizeInBytes.
subkanthi May 9, 2025
1b30ce0
ice: Reverted back removed line of setting dataFileSizeInBytes for s3…
subkanthi May 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions examples/docker-compose/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,44 @@ clickhouse client --query 'select count(*) from ice.`nyc.taxis`;'
1. `docker compose up` fails with `ERROR: Invalid interpolation format for "content" option in config "clickhouse-init": "#!/bin/bash`

Solution: Upgrade docker/docker compose to v2.

#### Spark Iceberg
A spark-iceberg container can be launched using the `docker-compose-spark-iceberg.yml` file.


The default configuration is located in the following path
`/opt/spark/conf/spark-defaults.conf`

For spark to communicate with `ice-rest-catalog` and `minio`, the following configuration variables
in `/opt/spark/conf/spark-defaults.conf` can to be updated.\

`spark.sql.catalog.demo.uri` - ice-rest-catalog URI \
`spark.sql.catalog.demo.s3.endpoint` - minio server url.
`spark.sql.catalog.demo.s3.access-key` - minio access key.
`spark.sql.catalog.demo.s3.secret-key` - minio password.


```
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.demo org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.demo.type rest
spark.sql.catalog.demo.uri http://localhost:5000
spark.sql.catalog.demo.io-impl org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.demo.warehouse s3://warehouse/wh/
spark.sql.catalog.demo.s3.endpoint http://localhost:9000
spark.sql.catalog.demo.s3.access-key miniouser
spark.sql.catalog.demo.s3.secret-key miniopassword
spark.sql.catalog.demo.s3.path-style-access true
spark.sql.catalog.demo.s3.ssl-enabled false
spark.sql.defaultCatalog demo
spark.eventLog.enabled true
spark.eventLog.dir /home/iceberg/spark-events
spark.history.fs.logDirectory /home/iceberg/spark-events
spark.sql.catalogImplementation in-memory
```
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing instructions on how to actually use any of this

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


The spark-sql shell can now query the tables directory

```
docker exec -it spark-iceberg ./spark-sql
```
17 changes: 17 additions & 0 deletions examples/docker-compose/docker-compose-spark-iceberg.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
services:
spark-iceberg:
image: tabulario/spark-iceberg
container_name: spark-iceberg
network_mode: host
volumes:
- ./warehouse:/home/iceberg/warehouse
- ./notebooks:/home/iceberg/notebooks/notebooks
environment:
- AWS_ACCESS_KEY_ID=miniouser
- AWS_SECRET_ACCESS_KEY=miniopassword
- AWS_REGION=minio
ports:
- 8888:8888
- 8080:8080
- 10000:10000
- 10001:10001
22 changes: 22 additions & 0 deletions examples/scratch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,21 @@ ice insert flowers.iris -p \

ice insert nyc.taxis -p \
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet

# Insert rows with sort-key
ice insert --sort-order='[{"column": "VendorID", "desc": true, "nullFirst": true}]' nyc.taxis2 -p https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet

# Insert sort-key(multiple columns)
ice insert --sort-order='[{"column": "VendorID", "desc": true, "nullFirst": true}, {"column": "Airport_fee", "desc": false, "nullFirst": false}]' nyc.taxis24 -p https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet

# Insert with partition key
ice insert --partition='[{"column": "RatecodeID", "transform": "identity"}]' nyc.taxis20 -p https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet

# Partition by day
ice insert --partition='[{"column": "tpep_pickup_datetime", "transform": "day"}]' nyc.taxis20 -p https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet

# Partition by year
ice insert --partition='[{"column": "tpep_pickup_datetime", "transform": "year"}]' nyc44.taxis111 -p https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet

# warning: each parquet file below is ~500mb. this may take a while
AWS_REGION=us-east-2 ice insert btc.transactions -p --s3-no-sign-request \
Expand All @@ -39,6 +54,13 @@ date=2025-01-01/part-00000-33e8d075-2099-409b-a806-68dd17217d39-c000.snappy.parq
# upload file to minio using local-mc,
# then add file to the catalog without making a copy
ice create-table flowers.iris_no_copy --schema-from-parquet=file://iris.parquet

# create table with partitioning columns
ice create-table flowers.irs_no_copy_partition --schema-from-parquet=file://iris.parquet --partition-by=variety,petal.width

# create table with sort columns
ice create-table flowers.irs_no_copy_sort --schema-from-parquet=file://iris.parquet --sort-order='[{"column": "variety", "desc": false}]'

local-mc cp iris.parquet local/bucket1/flowers/iris_no_copy/
ice insert flowers.iris_no_copy --no-copy s3://bucket1/flowers/iris_no_copy/iris.parquet

Expand Down
44 changes: 44 additions & 0 deletions ice/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,45 @@
<artifactId>iceberg-bundled-guava</artifactId>
<version>${iceberg.version}</version>
</dependency>
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-data</artifactId>
<version>${iceberg.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
</exclusion>
<exclusion>
<groupId>com.github.ben-manes.caffeine</groupId>
<artifactId>caffeine</artifactId>
</exclusion>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
</exclusion>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</exclusion>
<exclusion>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
</exclusion>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
<exclusion>
<groupId>io.airlift</groupId>
<artifactId>aircompressor</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-core</artifactId>
Expand All @@ -31,6 +70,7 @@
<groupId>com.github.ben-manes.caffeine</groupId>
<artifactId>caffeine</artifactId>
</exclusion>

<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
Expand All @@ -43,6 +83,10 @@
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</exclusion>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
Expand Down
87 changes: 83 additions & 4 deletions ice/src/main/java/com/altinity/ice/cli/Main.java
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,12 @@
import com.altinity.ice.cli.internal.config.Config;
import com.altinity.ice.internal.picocli.VersionProvider;
import com.altinity.ice.internal.strings.Strings;
import com.fasterxml.jackson.annotation.JsonProperty;
import com.fasterxml.jackson.databind.DeserializationFeature;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Scanner;
Expand Down Expand Up @@ -114,6 +118,14 @@ void describe(
}
}

public record IceSortOrder(
@JsonProperty("column") String column,
@JsonProperty("desc") boolean desc,
@JsonProperty("nullFirst") boolean nullFirst) {}

public record IcePartition(
@JsonProperty("column") String column, @JsonProperty("transform") String transform) {}

@CommandLine.Command(name = "create-table", description = "Create table.")
void createTable(
@CommandLine.Parameters(
Expand All @@ -138,16 +150,46 @@ void createTable(
required = true,
names = "--schema-from-parquet",
description = "/path/to/file.parquet")
String schemaFile)
String schemaFile,
@CommandLine.Option(
names = {"--partition"},
description =
"JSON array of partition specifications: [{\"column\":\"date\",\"transform\":\"year\"}],"
+ "Supported transforms: hour, day, month, year, identity(default)")
String partitionJson,
@CommandLine.Option(
names = {"--sort-order"},
description =
"JSON array of sort orders: [{\"column\":\"name\",\"desc\":true,\"nullFirst\":true}]")
String sortOrderJson)
throws IOException {
try (RESTCatalog catalog = loadCatalog(this.configFile())) {
List<IceSortOrder> sortOrders = new ArrayList<>();
List<IcePartition> partitions = new ArrayList<>();

if (sortOrderJson != null && !sortOrderJson.isEmpty()) {
ObjectMapper mapper = new ObjectMapper();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, true);
IceSortOrder[] orders = mapper.readValue(sortOrderJson, IceSortOrder[].class);
sortOrders = Arrays.asList(orders);
}

if (partitionJson != null && !partitionJson.isEmpty()) {
ObjectMapper mapper = new ObjectMapper();
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, true);
IcePartition[] parts = mapper.readValue(partitionJson, IcePartition[].class);
partitions = Arrays.asList(parts);
}

CreateTable.run(
catalog,
TableIdentifier.parse(name),
schemaFile,
location,
createTableIfNotExists,
s3NoSignRequest);
s3NoSignRequest,
partitions,
sortOrders);
}
}

Expand Down Expand Up @@ -206,6 +248,17 @@ void insert(
"/path/to/file where to save list of files to retry"
+ " (useful for retrying partially failed insert using `cat ice.retry | ice insert - --retry-list=ice.retry`)")
String retryList,
@CommandLine.Option(
names = {"--partition"},
description =
"JSON array of partition specifications: [{\"column\":\"date\",\"transform\":\"year\"}],"
+ "Supported transforms: hour, day, month, year, identity(default)")
String partitionJson,
@CommandLine.Option(
names = {"--sort-order"},
description =
"JSON array of sort orders: [{\"column\":\"name\",\"desc\":true,\"nullFirst\":true}]")
String sortOrderJson,
@CommandLine.Option(
names = {"--thread-count"},
description = "Number of threads to use for inserting data",
Expand All @@ -224,11 +277,35 @@ void insert(
return;
}
}

List<IceSortOrder> sortOrders = new ArrayList<>();
List<IcePartition> partitions = new ArrayList<>();

if (sortOrderJson != null && !sortOrderJson.isEmpty()) {
ObjectMapper mapper = new ObjectMapper();
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, true);
IceSortOrder[] orders = mapper.readValue(sortOrderJson, IceSortOrder[].class);
sortOrders = Arrays.asList(orders);
}

if (partitionJson != null && !partitionJson.isEmpty()) {
ObjectMapper mapper = new ObjectMapper();
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, true);
IcePartition[] parts = mapper.readValue(partitionJson, IcePartition[].class);
partitions = Arrays.asList(parts);
}

TableIdentifier tableId = TableIdentifier.parse(name);
if (createTableIfNotExists) {
// TODO: newCreateTableTransaction
CreateTable.run(
catalog, tableId, dataFiles[0], null, createTableIfNotExists, s3NoSignRequest);
catalog,
tableId,
dataFiles[0],
null,
createTableIfNotExists,
s3NoSignRequest,
partitions,
sortOrders);
}
Insert.run(
catalog,
Expand All @@ -243,6 +320,8 @@ void insert(
s3NoSignRequest,
s3CopyObject,
retryList,
partitions,
sortOrders,
threadCount < 1 ? Runtime.getRuntime().availableProcessors() : threadCount);
}
}
Expand Down
Loading