Spec: Add content stats to spec by nastra · Pull Request #14234 · apache/iceberg

nastra · 2025-10-02T09:43:34Z

These spec changes are related to the content stats proposal. Note that the root content_stats field hasn't been added yet, as its ID and position in the manifest currently depends on the v4 re-structuring at the manifest layer.

github-actions · 2025-11-02T00:19:02Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

danielcweeks · 2026-02-04T16:06:18Z

+| avg_value_count  | `int`               | 4                                   | false    | The avg value count for variable-length types (string/binary)                                                                                                                  |
+| max_value_count  | `long`              | 5                                   | false    | The max value count for variable-length types (string/binary)                                                                                                                  |


Should this be avg_value_length and max_value_length. I don't think this represents a count.

@rdblue since you brought up those two in the design doc, any thoughts about the exact naming of these two fields?

Dan is correct. These should be avg_value_size_in_bytes.

We should check with engine people to see if we want to include the max. If it isn't going to be valuable in the short term, we should add it later. Also, I think that we need to be consistent about the types. This is long but the average is int.

At least for Spark I think avg/max would be useful as we could report those through SupportsReportStatistics to Spark for CBO

@ebyhr could the avg/max value size be used in Trino to have a better estimate for the data_size statistic?

Is this size on disk or in memory? Trino CBO (TableScanStatsRule) uses the average size. The max value size is unused as far as I know.

@nastra, did you confirm whether max is used by Spark?

@rdblue yes and I think the only place where the max would be used by Spark is in the DSv2 Column Stats: https://github.com/apache/spark/blob/dc57455589a9f160638ab3526ca359eb84f76d67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala#L101, which we could pass via the SupportsReportStatistics API

Just because it is passed doesn't mean it is used. Can we find out if this is aspirational or whether it is really needed?

rdblue · 2026-05-06T21:58:57Z

+
+###### Name assignment for `content_stats` fields
+
+Each nested stats struct is a **child field** of the root `content_stats` struct. Its **name** is the numerical string of the table column's field id (for example id `103` uses the name `"103"`).


I don't think that this requirement is a good idea. We are going to rewrite this when reading to produce a struct that uses the current field name. Names don't matter, so we should just use the current name for readability.

I updated this to:

Content status must be resovled by ID; field names used for stats structs are informational. The recommended name for each field is the full name of the field in the table schema.

rdblue · 2026-05-08T22:30:18Z

@nastra, I opened nastra#139 with a lot of edits to simplify and apply some of the edits I noted. I also cleaned up the relationship to the existing variant and geo bounds, although variant still has some outstanding work (how to shred).

I'm going to move on to the section for how to maintain stats in manifest files next.

stevenzwu · 2026-05-12T21:38:56Z

+
+Implementations may produce stats structs for fields that are not in the table schema, if a field ID from the table's column ID space is assigned for the data values (by allocating an ID using `last-column-id`). Implementations are not required to write a stats struct for every table field.
+
+Fields with stats tracked in `content_stats` change based on updates like schema evolution or metrics configuration. Writers adapt to table changes by writing new manifest files with the implementation's current `content_stats` type. When existing file metadata is written to new manifests, writers must discard old stats, set unknown stats structs to null, and promote lower and upper bounds types to conform to the manifest schema.


This currently covers only type promotion. Is it worth stating each case explicitly?

Add column: a new stats struct field appears in newer manifests; older manifests without the field resolve via the v4 field default of null.

Drop column: when rewriting manifests with the current schema, the dropped column's stats struct is omitted; older manifests retain it and readers ignore it.

Rename column: ID-based resolution makes rename invisible on the wire; only the recommended name (the informational struct field name from the resolution rule above) changes when manifests are rewritten with the current schema.

I thought that drop and add were covered by "discard old stats" (drop) and "set unknown stats structs to null" (add). Maybe those should be more clear?

I also thought that this was clarified by the next paragraph, which states that all you need to do is use the current type to read and do promotion.

yeah. those can probably be more clear. "discard old stats" wasn't immediately clear to me. I got it after you mentioned it

stevenzwu · 2026-05-12T21:55:33Z

+
+| Requirement | Offset | Name | Type     | Description |
+|-------------|--------|------|----------|-------------|
+| _required_  | 10     | `x`  | `double` | Bounding box westernmost/xmin; [-180..180] |


Why do geo_lower offsets start at 10 and not 8?

Two options worth deciding here:

Compact geo offsets to 8–11 / 12–15 to eliminate the gap.

Keep the gap and add a note that offsets 8–9 are reserved for future common metrics — so readers do not wonder, and so adding 1 or 2 new metrics does not need to renumber.

Either is fine, but the current state (silent gap) may leave future readers wondering.

All of the offsets are reserved if they aren't specified here. I just wanted to give the geo bound structs memorable offsets.

I see. this is fine

stevenzwu · 2026-05-12T21:55:56Z

+| _optional_  | 6      | `nan_value_count`         | `long`                    | `float`, `double`                             | Number of NaN values in the column |
+| _optional_  | 7      | `avg_value_size_in_bytes` | `int`                     | `string`, `binary`, `variant`                 | Avg value size (uncompressed) in bytes to estimate memory consumption |
+
+For example, stats for a `required` `int` field named `id` with field-id `2` are stored using:


it might be better to move the example toward the end of this section and it would help to spell out the ID assignment for geo more explicitly.

The int and string cases are straightforward, but the geo case has two non-obvious twists: lower_bound/upper_bound are themselves structs, and the inner x/y/z/m IDs are assigned from the parent's range (offsets 10–17), not from a fresh range inside geo_lower. A side-by-side example would make this concrete.

The geo struct fields use offsets 10–17 from base-id, even though geo_lower is logically the value of lower_bound (offset 1). The flattened ID layout is unusual and a worked example (e.g., a geometry field with field-id = 2 producing IDs 10_400 for the stats struct, 10_401 for lower_bound aka geo_lower, and 10_410..10_413 for x/y/z/m) would remove ambiguity.

This sounds reasonable. Would you prefer to add another geo type example, or move this one? I think it would be better to leave this here and add a geo example since this is showing immediately after documenting offsets how they are used.

initially I was thinking move this one. But I agree that it is better to leave this here as geo is a different field type. A separate geo example makes more sense.

Co-authored-by: Steven Zhen Wu <stevenz3wu@gmail.com>

github-actions Bot added the Specification Issues that may introduce spec changes. label Oct 2, 2025

github-actions Bot added the stale label Nov 2, 2025

nastra added not-stale and removed stale labels Nov 3, 2025

nastra force-pushed the content-stats-spec branch from e86f035 to a1f6ad4 Compare November 24, 2025 14:07

nastra requested a review from danielcweeks November 24, 2025 14:08

nastra force-pushed the content-stats-spec branch 2 times, most recently from ae11c0a to 919c682 Compare November 24, 2025 14:32

nastra requested review from amogh-jahagirdar and singhpk234 January 9, 2026 15:21

nastra commented Jan 22, 2026

View reviewed changes