Skip to content

fix(reader): allow predicates on nested-leaf columns#2433

Open
vvarma wants to merge 2 commits into
apache:mainfrom
wesprint-io:vinayvarma/1-nested-pred
Open

fix(reader): allow predicates on nested-leaf columns#2433
vvarma wants to merge 2 commits into
apache:mainfrom
wesprint-io:vinayvarma/1-nested-pred

Conversation

@vvarma
Copy link
Copy Markdown

@vvarma vvarma commented May 11, 2026

Which issue does this PR close?

What changes are included in this PR?

Predicates that reference a primitive leaf inside a struct were rejected by PredicateConverter::bound_reference because the parquet column root for a nested leaf is the surrounding group. The downstream get_row_filter -> ArrowPredicateFn pipeline already projects each requested leaf via ProjectionMask::leaves; the leaf is available through the projected RecordBatch, just nested under its parent struct.
Two changes:

  1. Drop the get_column_root(...).is_group() rejection

  2. Have bound_reference return the parquet column path (root -> leaf) instead of just the leaf's projected index. project_column now walks that path: top-level columns work as before, nested-leaf paths descend through StructArray children by name.

Are these changes tested?

Tests added:

  1. test_get_row_filter_accepts_predicate_on_nested_leaf covers row-filter construction for a nested leaf path.

  2. test_perform_read_with_nested_leaf_predicate writes a Parquet file with a struct leaf and verifies the full ArrowReader pipeline returns only the row matching nested.value = 20.

vvarma and others added 2 commits May 11, 2026 11:43
Predicates that reference a primitive leaf inside a struct were rejected by PredicateConverter::bound_reference because the parquet column root for a nested leaf is the surrounding group. The downstream get_row_filter -> ArrowPredicateFn pipeline already projects each requested leaf via ProjectionMask::leaves; the leaf is available through the projected RecordBatch, just nested under its parent struct.

Two changes:

1. Drop the get_column_root(...).is_group() rejection

2. Have bound_reference return the parquet column path (root -> leaf) instead of just the leaf's projected index. project_column now walks that path: top-level columns work as before, nested-leaf paths descend through StructArray children by name.

Tests added:

1. test_get_row_filter_accepts_predicate_on_nested_leaf covers row-filter construction for a nested leaf path.

2. test_perform_read_with_nested_leaf_predicate writes a Parquet file with a struct leaf and verifies the full ArrowReader pipeline returns only the row matching nested.value = 20.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Predicates on nested-leaf columns fail

1 participant