Skip to content

Spec: Add content stats to spec#14234

Open
nastra wants to merge 11 commits into
apache:mainfrom
nastra:content-stats-spec
Open

Spec: Add content stats to spec#14234
nastra wants to merge 11 commits into
apache:mainfrom
nastra:content-stats-spec

Conversation

@nastra
Copy link
Copy Markdown
Contributor

@nastra nastra commented Oct 2, 2025

These spec changes are related to the content stats proposal. Note that the root content_stats field hasn't been added yet, as its ID and position in the manifest currently depends on the v4 re-structuring at the manifest layer.

@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label Oct 2, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 2, 2025

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Nov 2, 2025
@nastra nastra added not-stale and removed stale labels Nov 3, 2025
@nastra nastra force-pushed the content-stats-spec branch from e86f035 to a1f6ad4 Compare November 24, 2025 14:07
@nastra nastra requested a review from danielcweeks November 24, 2025 14:08
@nastra nastra force-pushed the content-stats-spec branch 2 times, most recently from ae11c0a to 919c682 Compare November 24, 2025 14:32
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment on lines +723 to +724
| avg_value_count | `int` | 4 | false | The avg value count for variable-length types (string/binary) |
| max_value_count | `long` | 5 | false | The max value count for variable-length types (string/binary) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be avg_value_length and max_value_length. I don't think this represents a count.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue since you brought up those two in the design doc, any thoughts about the exact naming of these two fields?

Copy link
Copy Markdown
Contributor

@rdblue rdblue Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dan is correct. These should be avg_value_size_in_bytes.

We should check with engine people to see if we want to include the max. If it isn't going to be valuable in the short term, we should add it later. Also, I think that we need to be consistent about the types. This is long but the average is int.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for Spark I think avg/max would be useful as we could report those through SupportsReportStatistics to Spark for CBO

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr could the avg/max value size be used in Trino to have a better estimate for the data_size statistic?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this size on disk or in memory? Trino CBO (TableScanStatsRule) uses the average size. The max value size is unused as far as I know.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra, did you confirm whether max is used by Spark?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue yes and I think the only place where the max would be used by Spark is in the DSv2 Column Stats: https://github.com/apache/spark/blob/dc57455589a9f160638ab3526ca359eb84f76d67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala#L101, which we could pass via the SupportsReportStatistics API

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because it is passed doesn't mean it is used. Can we find out if this is aspirational or whether it is really needed?

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated

###### Name assignment for `content_stats` fields

Each nested stats struct is a **child field** of the root `content_stats` struct. Its **name** is the numerical string of the table column's field id (for example id `103` uses the name `"103"`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this requirement is a good idea. We are going to rewrite this when reading to produce a struct that uses the current field name. Names don't matter, so we should just use the current name for readability.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this to:

Content status must be resovled by ID; field names used for stats structs are informational. The recommended name for each field is the full name of the field in the table schema.

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
@nastra nastra force-pushed the content-stats-spec branch from ee3f024 to 94c221a Compare May 7, 2026 10:42
@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 8, 2026

@nastra, I opened nastra#139 with a lot of edits to simplify and apply some of the edits I noted. I also cleaned up the relationship to the existing variant and geo bounds, although variant still has some outstanding work (how to shred).

I'm going to move on to the section for how to maintain stats in manifest files next.

@nastra nastra force-pushed the content-stats-spec branch from 9541f23 to a1c6f09 Compare May 12, 2026 15:38
@nastra nastra force-pushed the content-stats-spec branch from a1c6f09 to 4b21971 Compare May 12, 2026 15:44
Comment thread format/spec.md
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md

Implementations may produce stats structs for fields that are not in the table schema, if a field ID from the table's column ID space is assigned for the data values (by allocating an ID using `last-column-id`). Implementations are not required to write a stats struct for every table field.

Fields with stats tracked in `content_stats` change based on updates like schema evolution or metrics configuration. Writers adapt to table changes by writing new manifest files with the implementation's current `content_stats` type. When existing file metadata is written to new manifests, writers must discard old stats, set unknown stats structs to null, and promote lower and upper bounds types to conform to the manifest schema.
Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently covers only type promotion. Is it worth stating each case explicitly?

  • Add column: a new stats struct field appears in newer manifests; older manifests without the field resolve via the v4 field default of null.
  • Drop column: when rewriting manifests with the current schema, the dropped column's stats struct is omitted; older manifests retain it and readers ignore it.
  • Rename column: ID-based resolution makes rename invisible on the wire; only the recommended name (the informational struct field name from the resolution rule above) changes when manifests are rewritten with the current schema.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that drop and add were covered by "discard old stats" (drop) and "set unknown stats structs to null" (add). Maybe those should be more clear?

I also thought that this was clarified by the next paragraph, which states that all you need to do is use the current type to read and do promotion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. those can probably be more clear. "discard old stats" wasn't immediately clear to me. I got it after you mentioned it

Comment thread format/spec.md

| Requirement | Offset | Name | Type | Description |
|-------------|--------|------|----------|-------------|
| _required_ | 10 | `x` | `double` | Bounding box westernmost/xmin; [-180..180] |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do geo_lower offsets start at 10 and not 8?

Two options worth deciding here:

  1. Compact geo offsets to 8–11 / 12–15 to eliminate the gap.
  2. Keep the gap and add a note that offsets 8–9 are reserved for future common metrics — so readers do not wonder, and so adding 1 or 2 new metrics does not need to renumber.

Either is fine, but the current state (silent gap) may leave future readers wondering.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the offsets are reserved if they aren't specified here. I just wanted to give the geo bound structs memorable offsets.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. this is fine

Comment thread format/spec.md
| _optional_ | 6 | `nan_value_count` | `long` | `float`, `double` | Number of NaN values in the column |
| _optional_ | 7 | `avg_value_size_in_bytes` | `int` | `string`, `binary`, `variant` | Avg value size (uncompressed) in bytes to estimate memory consumption |

For example, stats for a `required` `int` field named `id` with field-id `2` are stored using:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be better to move the example toward the end of this section and it would help to spell out the ID assignment for geo more explicitly.

The int and string cases are straightforward, but the geo case has two non-obvious twists: lower_bound/upper_bound are themselves structs, and the inner x/y/z/m IDs are assigned from the parent's range (offsets 10–17), not from a fresh range inside geo_lower. A side-by-side example would make this concrete.

The geo struct fields use offsets 10–17 from base-id, even though geo_lower is logically the value of lower_bound (offset 1). The flattened ID layout is unusual and a worked example (e.g., a geometry field with field-id = 2 producing IDs 10_400 for the stats struct, 10_401 for lower_bound aka geo_lower, and 10_410..10_413 for x/y/z/m) would remove ambiguity.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds reasonable. Would you prefer to add another geo type example, or move this one? I think it would be better to leave this here and add a geo example since this is showing immediately after documenting offsets how they are used.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initially I was thinking move this one. But I agree that it is better to leave this here as geo is a different field type. A separate geo example makes more sense.

rdblue and others added 2 commits May 12, 2026 15:51
Co-authored-by: Steven Zhen Wu <stevenz3wu@gmail.com>
Co-authored-by: Steven Zhen Wu <stevenz3wu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

not-stale Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants