Skip to content

Antalya 26.3 Backport of #102628 - Fix LOGICAL_ERROR crash in Parquet reader for nullable columns with filter#1768

Open
mkmkme wants to merge 1 commit into
antalya-26.3from
backports/antalya-26.3/102628
Open

Antalya 26.3 Backport of #102628 - Fix LOGICAL_ERROR crash in Parquet reader for nullable columns with filter#1768
mkmkme wants to merge 1 commit into
antalya-26.3from
backports/antalya-26.3/102628

Conversation

@mkmkme
Copy link
Copy Markdown
Collaborator

@mkmkme mkmkme commented May 9, 2026

Fix LOGICAL_ERROR crash in Parquet reader for nullable columns with filter

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix LOGICAL_ERROR crash "Unexpected number of rows in column subchunk" in native Parquet V3 reader when reading nullable columns with a WHERE filter (ClickHouse#102628 by @groeneai).

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

…-null-check

Fix LOGICAL_ERROR crash in Parquet reader for nullable columns with filter
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Workflow [PR], commit [6d92950]

Copy link
Copy Markdown

@il9ue il9ue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — seems like clean backport ✅

Diff matches upstream #102628. Two fixes in Reader.cpp:

  1. Gate use_filter_in_decoder on !column.need_null_map. The fast path processes all rows through processDefLevelsForInnermostColumn and applies the filter at encoded-value indices, which don't line up 1:1 with row indices when nulls are present. Fall-back to the standard row-range path is correct.

  2. Inverted memchr: 01. ClickHouse convention is 1 = NULL, so the old check cleared the null_map whenever any non-null existed — exactly backwards. With all-NULL filtered rows into a non-Nullable column at null_as_default=0, this silently dropped the map instead of raising CANNOT_INSERT_NULL_IN_ORDINARY_COLUMN, then crashed downstream on the row-count mismatch.

Test covers all four meaningful cases (nullable output, null_as_default=1, the formerly-crashing path, and a no-nulls control). Correctly gated on input_format_parquet_use_native_reader_v3=1.

CI selection appropriate (Parquet/Iceberg/S3 Export, ASAN kept). Approve once green.

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented May 14, 2026

Audit: PR #1768 — Fix LOGICAL_ERROR in Parquet reader for nullable columns with filter

PR: #1768
Upstream: ClickHouse#102628
Scope: src/Processors/Formats/Impl/Parquet/Reader.cpp (2 independent changes), 2 new test files


Confirmed Defects (pre-fix, both resolved by this PR)

High — Filter-in-decoder misalignment for nullable columns

Anchor: Reader.cpp / decodePrimitiveColumnuse_filter_in_decoder decision block

Trigger: nullable Parquet column (need_null_map=true) + non-empty WHERE filter + page not
dictionary-encoded + no offset index (data_pages is empty)

Root cause:
When use_filter_in_decoder=true, readRowsInPage is called with the full subgroup end row as
end_row_idx, so processDefLevelsForInnermostColumn populates null_map for all rows in the
subgroup — both rows that pass the filter and rows that do not. At the same time, the decoder
receives the row-indexed filter and applies it at encoded-value positions. Because null rows
produce no encoded value, the two indices diverge: the filter is applied at wrong positions, and
null_map.size() ends up as rows_total while column->size() reflects a mismatched number of
decoded values.

expandDataByMask then throws:

LOGICAL_ERROR: Not enough bytes in mask

or, depending on the exact filter pattern, "Too many bytes in mask". If the filter happens to
produce zero decoded values, the size assertion at the end of decodePrimitiveColumn fires instead:

LOGICAL_ERROR: Unexpected number of rows in column subchunk 0 1

Impact: Any SELECT through the native Parquet V3 reader (input_format_parquet_use_native_reader_v3=1) that reads a nullable column with a WHERE filter crashes with LOGICAL_ERROR.

Fix: !column.need_null_map added to the use_filter_in_decoder guard. Falls back to the
standard row-range iteration path, which correctly reads only rows in passing filter ranges and
produces a null_map sized to rows_pass.


High — Inverted memchr sentinel in null-value check

Anchor: Reader.cpp / decodePrimitiveColumn, null check block (line ~1420 pre-fix)

Trigger: non-nullable output schema + null_as_default=0 + Parquet column that can contain
nulls (need_null_map=true) + WHERE filter

Root cause:
ClickHouse null_map convention is 1 = NULL, 0 = NOT NULL — established in
processDefLevelsForInnermostColumn:

is_null = def[i] != max_def;
out_null_map->push_back(is_null);   // bool: 1 for null, 0 for not-null

and confirmed by the expand(null_map, /*inverted*/ true) call, where inverted=true means
"insert default where mask is 1".

The old check searched for the wrong byte:

if (memchr(null_map.data(), 0, null_map.size()) != nullptr)   // ← searches for NOT-NULL
    throw Exception(... CANNOT_INSERT_NULL ...);

This produces two wrong outcomes:

  1. False positive: if the filtered result contains any non-null row (byte 0 exists in the
    map), the exception is thrown even though there are no actual NULLs in the output — a valid
    read is rejected.

  2. Silent pass → LOGICAL_ERROR: if the filter selects only null rows, the map contains all
    1s and memchr(..., 0, ...) returns nullptr — no exception is thrown, null_map is
    cleared, the column has 0 rows, but rows_pass > 0. The size assertion then fires:

    LOGICAL_ERROR: Unexpected number of rows in column subchunk 0 1
    

Impact: For non-nullable output with null_as_default=0, valid reads throw CANNOT_INSERT_NULL
incorrectly, and reads where all filtered values are NULL crash with LOGICAL_ERROR instead of
the expected CANNOT_INSERT_NULL.

Fix: Sentinel changed from 0 to 1 — the check now correctly detects NULL markers.


How the Two Bugs Interact

The LOGICAL_ERROR: Unexpected number of rows in column subchunk from the PR title requires both
bugs together:

  1. Bug 1 puts use_filter_in_decoder=true for a nullable column, causing the decoder to be given
    a filter — but null rows consume no encoded value, so encoded-value index ≠ row index.
  2. With a filter that selects only null rows, the decoder finds no encoded values to write
    (column->size() == 0).
  3. Bug 2's inverted check finds no byte 0 in an all-null map, so no exception is thrown and
    null_map is cleared.
  4. expand() is skipped (null_map is null); the final size assertion fires: 0 ≠ rows_pass.

Each bug is also independently harmful — Bug 1 corrupts values or crashes on any nullable+filtered
read; Bug 2 produces false CANNOT_INSERT_NULL or crashes on all-null filtered reads with
non-nullable output.


Coverage Summary

Scope reviewed decodePrimitiveColumn, readRowsInPage, processDefLevelsForInnermostColumn, expandDataByMask, need_null_map initialization, new test cases
Categories failed (pre-fix) filter-in-decoder/nullable alignment, inverted null-marker, downstream expand invariant
Categories passed concurrency (fields are read-only during decode), need_null_map=false path (statistics-driven, unaffected)
Limits static analysis only; offset-index path (data_pages non-empty) is out of scope — the existing data_pages.empty() guard already disables the optimization there

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented May 14, 2026

PR #1768 CI Triage

PR: #1768 - Antalya 26.3 Backport of #102628 - Fix LOGICAL_ERROR crash in Parquet reader for nullable columns with filter
CI report: ci_run_report.html
Date: 2026-05-14

PR Change Scope

Changed files:

  • src/Processors/Formats/Impl/Parquet/Reader.cpp
  • tests/queries/0_stateless/04099_parquet_nullable_null_check_fix.reference
  • tests/queries/0_stateless/04099_parquet_nullable_null_check_fix.sh

This PR is narrowly scoped to a Parquet reader null-check crash fix and an associated stateless regression test.

Summary

Category Count Checks
PR-caused regression 0 -
Unrelated regression-suite failures 8 Regression {parquet, s3_export_part, s3_export_partition, swarms} on both x86_64 and aarch64
Unrelated flaky/perf check failure 1 Stateless tests (arm_binary, sequential)
Non-test policy failure 1 DCO

Evidence from CI Metadata

  • CI report New Fails in PR: Nothing to report.
  • CI report Checks New Fails: one test: 00157_cache_dictionary in Stateless tests (arm_binary, sequential).
  • Failure set mostly matches recurring regression-suite patterns already seen on nearby PRs in this branch.

Root Cause Classification

1) Regression* / Parquet / parquet (x86_64 + aarch64) — Unrelated

Observed signatures:

  • snapshot drift such as Date -> Date32
  • snapshot drift such as DateTime -> DateTime64(0)

These are known snapshot/type-behavior mismatches in parquet regression tests. They are not specific to nullable null-check crash handling in Parquet/Reader.cpp.

2) Regression* / S3Export (partition) / s3_export_partition (x86_64 + aarch64) — Unrelated

Observed signature:

  • Code: 344 ... Exporting merge tree partition is experimental ... allow_experimental_export_merge_tree_partition ... (SUPPORT_IS_DISABLED)

This is a regression environment/config flag issue in export-partition tests, unrelated to Parquet reader code.

3) Regression* / S3Export (part) / s3_export_part (x86_64 + aarch64) — Unrelated

Observed signatures:

  • Code: 44 ... storage S3 doesn't support columns with dynamic structure (ILLEGAL_COLUMN) for JSON columns
  • occasional export consistency mismatch in concurrent alter scenario (part_log vs destination parts)

These failures are in S3 export-part test behavior and do not intersect with the Parquet nullable filter fix.

4) Regression* / Swarms / swarms (x86_64 + aarch64) — Unrelated

Observed signatures include:

  • ExpectTimeoutError: Timeout 600.000s
  • UNKNOWN_DATABASE
  • join/assertion mismatches in swarm join scenarios

This is swarms/distributed instability and does not map to changed Parquet reader logic.

5) Stateless tests (arm_binary, sequential) — Unrelated flaky/perf

CI report marks one new fail:

  • 00157_cache_dictionary

This test is historically flaky/slow (frequent timeout/perf behavior in prior triage runs) and does not align with PR #1768 changes (Parquet/Reader.cpp + 04099 test only).

6) DCO — Unrelated to runtime behavior

DCO is ACTION_REQUIRED, which is policy metadata and not a functional regression.

Verdict

There is no evidence that PR #1768 caused or is related to the observed CI failures.

The failing checks are attributable to existing regression-suite instability/configuration gaps (parquet, s3_export_part, s3_export_partition, swarms) plus one unrelated stateless flaky/perf test (00157_cache_dictionary).

@Selfeer Selfeer added the verified Approved for release label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants