Skip to content

Update datafusion dependency to latest in preparation for DF54#1532

Draft
timsaucer wants to merge 12 commits into
apache:mainfrom
timsaucer:feat/prepare-df-54
Draft

Update datafusion dependency to latest in preparation for DF54#1532
timsaucer wants to merge 12 commits into
apache:mainfrom
timsaucer:feat/prepare-df-54

Conversation

@timsaucer
Copy link
Copy Markdown
Member

@timsaucer timsaucer commented May 11, 2026

Which issue does this PR close?

Related to #1533

Rationale for this change

We are updating the upstream DataFusion dependency so that we can reduce the time to release 54 once the new version is released.

What changes are included in this PR?

Updated all datafusion dependencies to main at commit 3d06bedcc9afbd65781ac1de28741c36140d2cbb

Python test fixes (23 expectations) for upstream behavior changes:

  • median / approx_median / approx_percentile_cont return Float64 (was matching input type).
  • String functions (concat_ws, lower, upper, repeat, reverse, split_part, translate) return StringView for StringView input (was String).
  • overlay appends past end-of-string rather than replacing.
  • arrays_zip / list_zip struct field names changed from c0/c1 to "1"/"2".
  • Filter on mismatched cast types now errors (was 0 matches).

check-upstream audit trivial wins:

  • New DataFrame.alias(name) — wraps the logical plan in a SubqueryAlias for self-joins and qualifier-style references.
  • functions.__all__: add instr and position (both already defined as public defs but missing from __all__).
  • Top-level datafusion.__all__: re-export TableProviderFactory and TableProviderFactoryExportable (previously reachable only via the datafusion.catalog submodule).

Are there any user-facing changes?

Yes — several behavior changes inherited from upstream DataFusion 54 (warrants api change label):

  • median / approx_median / approx_percentile_cont now return Float64 rather than matching the input type.
  • String functions return StringView when fed StringView input (concat_ws, lower, upper, repeat, reverse, split_part, translate).
  • overlay semantics: passing a start position past the end of a string now appends the replacement, e.g. overlay("!", "--", 2) → "!--" (was "--").
  • arrays_zip / list_zip field names changed: c0/c1"1"/"2".
  • Comparing a numeric column against an incompatible string literal in a filter now raises a Cannot cast string error, where previously it silently produced zero matches.
  • New: DataFrame.alias(name), instr and position now appear under from datafusion.functions import *, TableProviderFactory and TableProviderFactoryExportable are now reachable from the top-level datafusion namespace.

timsaucer and others added 2 commits May 11, 2026 09:03
Bump workspace deps to apache/datafusion@3d06bedc (git pin) in
preparation for the 54.0.0 release. Workspace package version moves
to 54.0.0 to track the upstream major convention.

Compile fixes:
- Drop as_any impls (trait now has Any as supertrait) and use the
  upstream-provided downcast_ref helper on dyn trait objects.
- Reconcile FFI provider From conversions to drop redundant `+ Send`
  on Arc<dyn ...> bounds.
- Cast/TryCast: data_type → field.data_type() (FieldRef rename).
- Stub match arms for new Expr::HigherOrderFunction / Lambda /
  LambdaVariable and ScalarValue::ListView / LargeListView variants;
  proper exposure deferred to PR 3 audit.
- DatasetExec: partition_statistics returns Arc<Statistics>; add
  required apply_expressions trait method.
- Suppress TableFunctionImpl::call deprecation pending call_with_args
  refactor that needs Session plumbing.

User-facing test updates for upstream behavior changes:
- median / approx_median / approx_percentile_cont now return Float64.
- String functions (concat_ws, lower, upper, repeat, reverse,
  split_part, translate) return StringView when given StringView.
- overlay appends past end-of-string rather than replacing the input.
- arrays_zip / list_zip struct field names "c0"/"c1" → "1"/"2".
- Filter on mismatched cast types now errors (was 0 matches).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to the upstream DataFusion 53 → main bump. The
check-upstream audit (PR 3 of dev/release/upstream-sync.md) surfaced a
small set of trivial wins; this commit ships them.

Trivial wins:
- DataFrame.alias(name) — wraps the logical plan in a SubqueryAlias.
- functions.__all__: add `instr` and `position` (both were defined as
  public defs but missing from `__all__`, so they didn't show up in
  `from datafusion.functions import *` or generated docs).
- top-level `datafusion.__all__`: re-export `TableProviderFactory` and
  `TableProviderFactoryExportable` (previously only reachable via the
  `datafusion.catalog` submodule).

Non-trivial gaps surfaced by the audit (DataFrame.registry,
into_*/task_ctx, SessionContext extensibility surface, distinct-aware
aggregate variants, TableFunctionImpl::call_with_args migration, FFI
Protocol pipeline gaps) are deferred — each warrants its own design
and PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer changed the title Feat/prepare df 54 Update datafusion dependency to latest in preparation for DF54 May 11, 2026
timsaucer and others added 7 commits May 12, 2026 15:17
Prior example called alias("t") then to_pydict(), which did not show
the qualifier effect. Replace with a self-join that uses col("l.val")
and col("r.val") so the disambiguation behavior is visible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFusion 54 introduces Expr::HigherOrderFunction, Expr::Lambda, and
Expr::LambdaVariable. PyExpr::to_variant previously errored on each
with py_unsupported_variant_err. Add PyHigherOrderFunction, PyLambda,
and PyLambdaVariable wrappers, register them in the expr pymodule and
re-export from python/datafusion/expr.py, and dispatch to_variant to
the new wrappers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Map HigherOrderFunction and Lambda to RexType::Call; LambdaVariable to
RexType::Reference. In rex_call_operands return the args for
HigherOrderFunction, the body for Lambda, and self for LambdaVariable
(mirroring Column). In rex_call_operator return the underlying UDF
name for HigherOrderFunction and the literal "lambda" for Lambda.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arrow

These ScalarValue variants all wrap Arc<...Array>, exposing the outer
DataType via Array::data_type(), so we can mirror the existing
ScalarValue::List arm instead of returning PyNotImplementedError. This
makes Expr.types() work for plans that round-trip through SQL or proto
where these scalar variants surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFusion 53.0.0 deprecated TableFunctionImpl::call in favor of
call_with_args(args: TableFunctionArgs), which threads a Session
reference alongside the exprs. Implement call_with_args on
PyTableFunction (delegating to the FFI variant's call_with_args, or
ignoring the session for the pure-Python variant which doesn't use it)
and have __call__ build a TableFunctionArgs from the global session.
Drops both #[allow(deprecated)] attributes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atch.crates-io]

The workspace version was prematurely bumped to 54.0.0 in the
DF53→pre-54 upgrade. Restore it to 53.0.0 until we are actually
ready to cut the 54 release.

The same change had moved every datafusion-* dependency from a
crates.io version constraint to a direct git dep in
[workspace.dependencies]. Switch them back to "version = \"53\"" and
move the git rev overrides into [patch.crates-io] so the published
manifest will be patch-free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer self-assigned this May 12, 2026
Multi-partition `collect()` returns batches in execution-scheduling
order, which is non-deterministic and differs between local and CI
runners. Sort by the first value of column 0 (unique per partition in
each affected test) so the expected/actual comparison is stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant