Spec: Add content stats to spec#14234
Conversation
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
e86f035 to
a1f6ad4
Compare
ae11c0a to
919c682
Compare
| | avg_value_count | `int` | 4 | false | The avg value count for variable-length types (string/binary) | | ||
| | max_value_count | `long` | 5 | false | The max value count for variable-length types (string/binary) | |
There was a problem hiding this comment.
Should this be avg_value_length and max_value_length. I don't think this represents a count.
There was a problem hiding this comment.
@rdblue since you brought up those two in the design doc, any thoughts about the exact naming of these two fields?
There was a problem hiding this comment.
Dan is correct. These should be avg_value_size_in_bytes.
We should check with engine people to see if we want to include the max. If it isn't going to be valuable in the short term, we should add it later. Also, I think that we need to be consistent about the types. This is long but the average is int.
There was a problem hiding this comment.
At least for Spark I think avg/max would be useful as we could report those through SupportsReportStatistics to Spark for CBO
There was a problem hiding this comment.
@ebyhr could the avg/max value size be used in Trino to have a better estimate for the data_size statistic?
There was a problem hiding this comment.
Is this size on disk or in memory? Trino CBO (TableScanStatsRule) uses the average size. The max value size is unused as far as I know.
There was a problem hiding this comment.
@nastra, did you confirm whether max is used by Spark?
There was a problem hiding this comment.
@rdblue yes and I think the only place where the max would be used by Spark is in the DSv2 Column Stats: https://github.com/apache/spark/blob/dc57455589a9f160638ab3526ca359eb84f76d67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala#L101, which we could pass via the SupportsReportStatistics API
There was a problem hiding this comment.
Just because it is passed doesn't mean it is used. Can we find out if this is aspirational or whether it is really needed?
1513125 to
d25883c
Compare
|
|
||
| ###### Name assignment for `content_stats` fields | ||
|
|
||
| Each nested stats struct is a **child field** of the root `content_stats` struct. Its **name** is the numerical string of the table column's field id (for example id `103` uses the name `"103"`). |
There was a problem hiding this comment.
I don't think that this requirement is a good idea. We are going to rewrite this when reading to produce a struct that uses the current field name. Names don't matter, so we should just use the current name for readability.
There was a problem hiding this comment.
I updated this to:
Content status must be resovled by ID; field names used for stats structs are informational. The recommended name for each field is the full name of the field in the table schema.
|
@nastra, I opened nastra#139 with a lot of edits to simplify and apply some of the edits I noted. I also cleaned up the relationship to the existing variant and geo bounds, although variant still has some outstanding work (how to shred). I'm going to move on to the section for how to maintain stats in manifest files next. |
9541f23 to
a1c6f09
Compare
a1c6f09 to
4b21971
Compare
|
|
||
| Implementations may produce stats structs for fields that are not in the table schema, if a field ID from the table's column ID space is assigned for the data values (by allocating an ID using `last-column-id`). Implementations are not required to write a stats struct for every table field. | ||
|
|
||
| Fields with stats tracked in `content_stats` change based on updates like schema evolution or metrics configuration. Writers adapt to table changes by writing new manifest files with the implementation's current `content_stats` type. When existing file metadata is written to new manifests, writers must discard old stats, set unknown stats structs to null, and promote lower and upper bounds types to conform to the manifest schema. |
There was a problem hiding this comment.
This currently covers only type promotion. Is it worth stating each case explicitly?
- Add column: a new stats struct field appears in newer manifests; older manifests without the field resolve via the v4 field default of
null. - Drop column: when rewriting manifests with the current schema, the dropped column's stats struct is omitted; older manifests retain it and readers ignore it.
- Rename column: ID-based resolution makes rename invisible on the wire; only the recommended name (the informational struct field name from the resolution rule above) changes when manifests are rewritten with the current schema.
There was a problem hiding this comment.
I thought that drop and add were covered by "discard old stats" (drop) and "set unknown stats structs to null" (add). Maybe those should be more clear?
I also thought that this was clarified by the next paragraph, which states that all you need to do is use the current type to read and do promotion.
There was a problem hiding this comment.
yeah. those can probably be more clear. "discard old stats" wasn't immediately clear to me. I got it after you mentioned it
|
|
||
| | Requirement | Offset | Name | Type | Description | | ||
| |-------------|--------|------|----------|-------------| | ||
| | _required_ | 10 | `x` | `double` | Bounding box westernmost/xmin; [-180..180] | |
There was a problem hiding this comment.
Why do geo_lower offsets start at 10 and not 8?
Two options worth deciding here:
- Compact geo offsets to 8–11 / 12–15 to eliminate the gap.
- Keep the gap and add a note that offsets 8–9 are reserved for future common metrics — so readers do not wonder, and so adding 1 or 2 new metrics does not need to renumber.
Either is fine, but the current state (silent gap) may leave future readers wondering.
There was a problem hiding this comment.
All of the offsets are reserved if they aren't specified here. I just wanted to give the geo bound structs memorable offsets.
| | _optional_ | 6 | `nan_value_count` | `long` | `float`, `double` | Number of NaN values in the column | | ||
| | _optional_ | 7 | `avg_value_size_in_bytes` | `int` | `string`, `binary`, `variant` | Avg value size (uncompressed) in bytes to estimate memory consumption | | ||
|
|
||
| For example, stats for a `required` `int` field named `id` with field-id `2` are stored using: |
There was a problem hiding this comment.
it might be better to move the example toward the end of this section and it would help to spell out the ID assignment for geo more explicitly.
The int and string cases are straightforward, but the geo case has two non-obvious twists: lower_bound/upper_bound are themselves structs, and the inner x/y/z/m IDs are assigned from the parent's range (offsets 10–17), not from a fresh range inside geo_lower. A side-by-side example would make this concrete.
The geo struct fields use offsets 10–17 from base-id, even though geo_lower is logically the value of lower_bound (offset 1). The flattened ID layout is unusual and a worked example (e.g., a geometry field with field-id = 2 producing IDs 10_400 for the stats struct, 10_401 for lower_bound aka geo_lower, and 10_410..10_413 for x/y/z/m) would remove ambiguity.
There was a problem hiding this comment.
This sounds reasonable. Would you prefer to add another geo type example, or move this one? I think it would be better to leave this here and add a geo example since this is showing immediately after documenting offsets how they are used.
There was a problem hiding this comment.
initially I was thinking move this one. But I agree that it is better to leave this here as geo is a different field type. A separate geo example makes more sense.
Co-authored-by: Steven Zhen Wu <stevenz3wu@gmail.com>
Co-authored-by: Steven Zhen Wu <stevenz3wu@gmail.com>
These spec changes are related to the content stats proposal. Note that the root
content_statsfield hasn't been added yet, as its ID and position in the manifest currently depends on the v4 re-structuring at the manifest layer.