Skip to content

Antalya 26.3 Backport of #99521, #100150 - Add Arrow and Parquet format support for UUID data type#1774

Open
mkmkme wants to merge 2 commits into
antalya-26.3from
backports/antalya-26.3/99521
Open

Antalya 26.3 Backport of #99521, #100150 - Add Arrow and Parquet format support for UUID data type#1774
mkmkme wants to merge 2 commits into
antalya-26.3from
backports/antalya-26.3/99521

Conversation

@mkmkme
Copy link
Copy Markdown
Collaborator

@mkmkme mkmkme commented May 9, 2026

Note for reviewer

Besides 99521, 100150 was backported as well as a follow-up PR fixing an issue in 99521. The tests are passing locally (apart from ones that couldn't run on the local machine)

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Adds native support for importing and exporting UUID data types in Arrow and Parquet formats. Users can now directly query and transfer UUID data between ClickHouse and other data tools without requiring manual string conversions or workarounds. Automated logical inference for top-level UUIDs, and support for explicit schema hint for nested UUIDs (ClickHouse#99521 by @ivanmantova).
Exporting UUIDs to Parquet via the Arrow encoder now includes the correct UUID type annotation, eliminating the need to manually cast FixedString(16) data when reading the files back into ClickHouse or other systems (ClickHouse#100150 by @ivanmantova).

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

…rquet-uuid

Add Arrow and Parquet format support for UUID data type
…et-builder

Improve Arrow Parquet writer to include UUID logical type
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Workflow [PR], commit [b7cb314]

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented May 13, 2026

PR #1774 CI Triage

PR: #1774 - Antalya 26.3 Backport of #99521, #100150
CI report: ci_run_report.html
Date: 2026-05-13

PR Change Scope

This PR is focused on Arrow/Parquet UUID support:

  • src/Processors/Formats/Impl/ArrowColumnToCHColumn.cpp
  • src/Processors/Formats/Impl/CHColumnToArrowColumn.cpp
  • src/Processors/Formats/Impl/Parquet/*
  • new/updated stateless tests around Arrow/Parquet UUID import/export/inference.

No changes in s3/swarms regression framework or export-partition/export-part logic.

Summary

Category Count Checks
PR-caused regression 0 -
Pre-existing unrelated regression-suite failures 8 Regression{Release,Aarch64} / {Parquet, S3Export(part), S3Export(partition), Swarms}
Infrastructure/flaky patterns (within swarms) present timeout and unstable cluster scenarios
Policy/non-test 1 DCO action required

Root Cause Classification

1) Regression* / Parquet / parquet (x86_64 + aarch64) — Unrelated

Observed root signatures:

  • snapshot mismatches such as:
    • Date -> Date32
    • DateTime -> DateTime64(0)

These are longstanding snapshot drift/type-inference differences in parquet regression tests and do not indicate UUID-format handling regressions.

As a matter of fact one of the test failure indicated that this change works correctly and tests need to be adjusted in regression:

                                              -'u\tNullable(FixedString(16))'
                                              +'u\tNullable(UUID)'

2) Regression* / S3Export (part) / s3_export_part (x86_64 + aarch64) — Unrelated

Observed root signatures:

  • Code: 44 ... Cannot create table with column of type Dynamic or JSON, because storage S3 doesn't support columns with dynamic structure. (ILLEGAL_COLUMN)
  • failing scenarios are json columns and json columns with hints.

This is an S3 JSON/dynamic-type limitation in export-part tests, not related to Arrow/Parquet UUID support.

3) Regression* / S3Export (partition) / s3_export_partition (x86_64 + aarch64) — Unrelated

Observed root signature:

  • Code: 344 ... Exporting merge tree partition is experimental. Set allow_experimental_export_merge_tree_partition ... (SUPPORT_IS_DISABLED)

This is a feature-flag/configuration issue in regression environment, unrelated to UUID format conversion code.

4) Regression* / Swarms / swarms (x86_64 + aarch64) — Unrelated

Observed root signatures:

  • ExpectTimeoutError: Timeout 600.000s
  • UNKNOWN_DATABASE in dynamic datalake catalog scenarios
  • nondeterministic join/assertion mismatches in swarm/node-failure tests.

These are flaky/distributed-environment failures and do not map to PR file changes.

5) DCOUnrelated to runtime behavior

DCO is ACTION_REQUIRED but this is process metadata, not a product regression.

Verdict

The failed checks are not related to PR #1774 code changes.

This PR changes Arrow/Parquet UUID read/write paths and associated stateless UUID tests; failures are in known unstable or independently broken regression suites (parquet snapshot drift, s3_export_* feature/config issues, and swarms instability).

From CI triage perspective, there is no evidence of a new PR-caused regression.

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented May 13, 2026

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1774 (Arrow/Parquet UUID import-export support and inference paths):

No confirmed defects in reviewed scope.

Coverage summary:

Scope reviewed: `ArrowColumnToCHColumn` UUID decode/inference and extension unwrapping, `CHColumnToArrowColumn` UUID extension/schema-builder paths, Parquet UUID schema/decoder/writer changes, and added stateless UUID regression tests (top-level, nested, inference, and writer round-trip).
Categories failed: None in reviewed paths.
Categories passed: branch coverage for UUID vs FixedString fallbacks, fail-closed checks on fixed-size width/type mismatch, metadata-driven inference paths (Arrow extension + Parquet logical UUID), endianness conversion symmetry (read/write), nullable handling paths, exception/partial-update safety (column-local mutation only), multithreaded/shared-state checks (no new shared mutable state), and C++ memory/UB/resource classes in changed code.
Assumptions/limits: static reasoning only (no runtime execution in this audit), and conclusions are limited to PR #1774 diff plus directly connected call graph.

@Selfeer Selfeer added the verified Approved for release label May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants