Skip to content

Commit 1afc4a4

Browse files
RafaelHerrerolizardoluis
authored andcommitted
fix: skip empty metadata in intersect_metadata_for_union to prevent s… (apache#21127)
## Which issue does this PR close? - Closes apache#19049. ## Rationale for this change We're building a SQL engine on top of DataFusion and hit this while running benchmarks. A `UNION ALL` query against Parquet files that carry field metadata (like `PARQUET:field_id` or InfluxDB's `iox::column::type`). When one branch of the union has a NULL literal, `intersect_metadata_for_union` intersects the metadata from the data source with the empty metadata from the NULL — and since intersecting anything with an empty set gives empty, all metadata gets wiped out. Later, when `optimize_projections` prunes columns and `recompute_schema` rebuilds the Union schema, the logical schema has `{}` while the physical schema still has the original metadata from Parquet. This causes: ``` Internal error: Physical input schema should be the same as the one converted from logical input schema. Differences: - field metadata at index 0 [usage_idle]: (physical) {"iox::column::type": "..."} vs (logical) {} ``` As @erratic-pattern and @alamb discussed in the issue, empty metadata from NULL literals isn't saying "this field has no metadata" — it's saying "I don't know." It shouldn't erase metadata from branches that actually have it. I fixed this in `intersect_metadata_for_union` directly rather than patching `optimize_projections` or `recompute_schema`, since that's where the bad intersection happens and it covers all code paths that derive Union schemas. ## What changes are included in this PR? One change to `intersect_metadata_for_union` in `datafusion/expr/src/expr.rs`: branches with empty metadata are skipped during intersection instead of participating. Non-empty branches still intersect normally (conflicting values still get dropped). If every branch is empty, the result is empty — same as before. ## Are these changes tested? Added 7 unit tests for `intersect_metadata_for_union`: - Same metadata across branches — preserved - Conflicting non-empty values — dropped (existing behavior, unchanged) - One branch has metadata, other is empty — metadata preserved (the fix) - Empty branch comes first — still works - All branches empty — empty result - Mix of empty and conflicting non-empty — intersects only the non-empty ones - No inputs — empty result The full end-to-end reproduction needs Parquet files with field metadata as described in the issue. The unit tests cover the intersection logic directly. ## Are there any user-facing changes? No API changes. `UNION ALL` queries combining metadata-carrying sources with NULL literals will stop failing with schema mismatch errors.
1 parent 5746048 commit 1afc4a4

2 files changed

Lines changed: 96 additions & 7 deletions

File tree

datafusion/expr/src/expr.rs

Lines changed: 79 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -510,17 +510,26 @@ pub type SchemaFieldMetadata = std::collections::HashMap<String, String>;
510510
pub fn intersect_metadata_for_union<'a>(
511511
metadatas: impl IntoIterator<Item = &'a SchemaFieldMetadata>,
512512
) -> SchemaFieldMetadata {
513-
let mut metadatas = metadatas.into_iter();
514-
let Some(mut intersected) = metadatas.next().cloned() else {
515-
return Default::default();
516-
};
513+
let mut intersected: Option<SchemaFieldMetadata> = None;
517514

518515
for metadata in metadatas {
519-
// Only keep keys that exist in both with the same value
520-
intersected.retain(|k, v| metadata.get(k) == Some(v));
516+
// Skip empty metadata (e.g. from NULL literals or computed expressions)
517+
// to avoid dropping metadata from branches that have it.
518+
if metadata.is_empty() {
519+
continue;
520+
}
521+
match &mut intersected {
522+
None => {
523+
intersected = Some(metadata.clone());
524+
}
525+
Some(current) => {
526+
// Only keep keys that exist in both with the same value
527+
current.retain(|k, v| metadata.get(k) == Some(v));
528+
}
529+
}
521530
}
522531

523-
intersected
532+
intersected.unwrap_or_default()
524533
}
525534

526535
/// UNNEST expression.
@@ -4113,4 +4122,67 @@ mod test {
41134122
}
41144123
}
41154124
}
4125+
4126+
mod intersect_metadata_tests {
4127+
use super::super::intersect_metadata_for_union;
4128+
use std::collections::HashMap;
4129+
4130+
#[test]
4131+
fn all_branches_same_metadata() {
4132+
let m1 = HashMap::from([("key".into(), "val".into())]);
4133+
let m2 = HashMap::from([("key".into(), "val".into())]);
4134+
let result = intersect_metadata_for_union([&m1, &m2]);
4135+
assert_eq!(result, HashMap::from([("key".into(), "val".into())]));
4136+
}
4137+
4138+
#[test]
4139+
fn conflicting_metadata_dropped() {
4140+
let m1 = HashMap::from([("key".into(), "a".into())]);
4141+
let m2 = HashMap::from([("key".into(), "b".into())]);
4142+
let result = intersect_metadata_for_union([&m1, &m2]);
4143+
assert!(result.is_empty());
4144+
}
4145+
4146+
#[test]
4147+
fn empty_metadata_branch_skipped() {
4148+
let m1 = HashMap::from([("key".into(), "val".into())]);
4149+
let m2 = HashMap::new(); // e.g. NULL literal
4150+
let result = intersect_metadata_for_union([&m1, &m2]);
4151+
assert_eq!(result, HashMap::from([("key".into(), "val".into())]));
4152+
}
4153+
4154+
#[test]
4155+
fn empty_metadata_first_branch_skipped() {
4156+
let m1 = HashMap::new();
4157+
let m2 = HashMap::from([("key".into(), "val".into())]);
4158+
let result = intersect_metadata_for_union([&m1, &m2]);
4159+
assert_eq!(result, HashMap::from([("key".into(), "val".into())]));
4160+
}
4161+
4162+
#[test]
4163+
fn all_branches_empty_metadata() {
4164+
let m1: HashMap<String, String> = HashMap::new();
4165+
let m2: HashMap<String, String> = HashMap::new();
4166+
let result = intersect_metadata_for_union([&m1, &m2]);
4167+
assert!(result.is_empty());
4168+
}
4169+
4170+
#[test]
4171+
fn mixed_empty_and_conflicting() {
4172+
let m1 = HashMap::from([("key".into(), "a".into())]);
4173+
let m2 = HashMap::new();
4174+
let m3 = HashMap::from([("key".into(), "b".into())]);
4175+
let result = intersect_metadata_for_union([&m1, &m2, &m3]);
4176+
// m2 is skipped; m1 and m3 conflict → dropped
4177+
assert!(result.is_empty());
4178+
}
4179+
4180+
#[test]
4181+
fn no_inputs() {
4182+
let result = intersect_metadata_for_union(std::iter::empty::<
4183+
&HashMap<String, String>,
4184+
>());
4185+
assert!(result.is_empty());
4186+
}
4187+
}
41164188
}

datafusion/sqllogictest/test_files/metadata.slt

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,23 @@ ORDER BY id, name, l_name;
139139
NULL bar NULL
140140
NULL NULL l_bar
141141

142+
# Regression test: UNION ALL + optimize_projections pruning unused columns causes
143+
# metadata loss when one branch has NULL literals (empty metadata) and the other
144+
# has field metadata. The optimizer prunes unused columns, triggering recompute_schema
145+
# which rebuilds the Union schema via intersect_metadata_for_union. Previously, this
146+
# intersection would drop metadata when any branch had empty metadata (from NULL literals).
147+
# See https://github.com/apache/datafusion/issues/19049
148+
query T
149+
SELECT name FROM (
150+
SELECT id, name FROM "table_with_metadata"
151+
UNION ALL
152+
SELECT NULL::int as id, NULL::string as name
153+
) GROUP BY name ORDER BY name;
154+
----
155+
bar
156+
baz
157+
NULL
158+
142159
# Regression test: missing field metadata from left side of the union when right side is chosen
143160
query T
144161
select name from (

0 commit comments

Comments
 (0)