Skip to content

[12] Added metrics when writing files to iceberg from parquet.#15

Merged
shyiko merged 7 commits into
masterfrom
add_metrics
Apr 23, 2025
Merged

[12] Added metrics when writing files to iceberg from parquet.#15
shyiko merged 7 commits into
masterfrom
add_metrics

Conversation

@subkanthi
Copy link
Copy Markdown
Collaborator

 Received metrics report: CommitReport{tableName=default.nyc.taxis, snapshotId=6106227384613789017, sequenceNumber=1, operation=append, commitMetrics=CommitMetricsResult{totalDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT0.302619377S, count=1}, attempts=CounterResult{unit=COUNT, value=1}, addedDataFiles=CounterResult{unit=COUNT, value=1}, removedDataFiles=null, totalDataFiles=CounterResult{unit=COUNT, value=1}, addedDeleteFiles=null, addedEqualityDeleteFiles=null, addedPositionalDeleteFiles=null, addedDVs=null, removedDeleteFiles=null, removedEqualityDeleteFiles=null, removedPositionalDeleteFiles=null, removedDVs=null, totalDeleteFiles=CounterResult{unit=COUNT, value=0}, addedRecords=CounterResult{unit=COUNT, value=3475226}, removedRecords=null, totalRecords=CounterResult{unit=COUNT, value=3475226}, addedFilesSizeInBytes=CounterResult{unit=BYTES, value=55934912}, removedFilesSizeInBytes=null, totalFilesSizeInBytes=CounterResult{unit=BYTES, value=55934912}, addedPositionalDeletes=null, removedPositionalDeletes=null, totalPositionalDeletes=CounterResult{unit=COUNT, value=0}, addedEqualityDeletes=null, removedEqualityDeletes=null, totalEqualityDeletes=CounterResult{unit=COUNT, value=0}}, metadata={iceberg-version=Apache Iceberg 1.8.1 (commit 9ce0fcf0af7becf25ad9fc996c3bad2afdcfd33d)}}

@subkanthi
Copy link
Copy Markdown
Collaborator Author

subkanthi commented Apr 23, 2025

Testing
Null count

ice

    Column: Airport_fee
      valueCount  = 3475226
      nullCount   = 540149
      lowerBound  = -1.75
      upperBound  = 6.75

ClickHouse

SELECT count(*)
FROM s3('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet', '', '', 'Parquet')
WHERE Airport_fee IS NULL

Query id: 0f136a70-d19e-4b92-8e0c-ef5ef20ddae3

   ┌─count()─┐
1. │  540149 │e key to toggle the display of the progress table.
   └─────────┘

Value count

SELECT count(*)
FROM s3('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet', '', '', 'Parquet')
WHERE (Airport_fee IS NULL) OR (Airport_fee IS NOT NULL)

Query id: fac5058c-7e80-4ace-8a88-7d929844191d

   ┌─count()─┐
1. │ 3475226 │ -- 3.48 millionthe display of the progress table.
   └─────────┘

1 row in set. Elapsed: 0.765 sec. Processed 3.01 million rows, 50.93 MB (3.94 million rows/s., 66.59 MB/s.)
Peak memory usage: 5.10 MiB.

Upper bound/Lower bound

:) SELECT Max(Airport_fee)
FROM s3(
  'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet',
  '', -- No access key
  '', -- No secret key
  'Parquet'
);

SELECT Max(Airport_fee)
FROM s3('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet', '', '', 'Parquet')

Query id: 533faa65-303f-41e6-96f1-930ddf1e3e9d

   ┌─Max(Airport_fee)─┐
1. │             6.75 │toggle the display of the progress table.
   └──────────────────┘

1 row in set. Elapsed: 0.802 sec. Processed 1.38 million rows, 23.31 MB (1.72 million rows/s., 29.05 MB/s.)
Peak memory usage: 5.60 MiB.

@subkanthi subkanthi marked this pull request as ready for review April 23, 2025 01:08
Copy link
Copy Markdown
Collaborator

@shyiko shyiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits but looks great otherwise! We are both going to hell now for using StringBuilder to assemble the output :)

# insert data into catalog
ice insert flowers.iris -p \
file://iris.parquet
ice
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks unintentional

*
* @param table
*/
private static void printTableMetrics(Table table, StringBuilder buffer) throws IOException {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this javadoc conveys no extra info beyond method name. I'd just drop it

# start Iceberg REST Catalog server
ice-rest-catalog
or using the jar file.
/ice/examples/scratch$ java -jar ../../ice-rest-catalog/target/ice-rest-catalog-jar-with-dependencies.jar &
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section (of README) is already followed by

TIP: replace ice & ice-rest-catalog above with local-ice & local-ice-rest-catalog respectively to use code in the repo instead of ice & ice-rest-catalog binaries from the PATH.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIP: if you execute link-local, ice & ice-rest-catalog will point to local-ice & local-ice-rest-catalog when inside the repo. Just make sure you have direnv installed.

@shyiko shyiko merged commit 1b917e3 into master Apr 23, 2025
1 check passed
@shyiko
Copy link
Copy Markdown
Collaborator

shyiko commented Apr 23, 2025

Merged to save time. I'll take care of the nit ^ in a follow up commit.

shyiko added a commit that referenced this pull request Apr 23, 2025
@shyiko shyiko deleted the add_metrics branch June 3, 2025 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants