Skip to content

Commit 7c34289

Browse files
authored
docs: adopt Option 2 provenance columns for concept index (resolves #58 finding 1) (#80)
* docs: adopt Option 2 provenance columns for concept index (issue #58 finding 1) Resolves the blocker from the issue #58 doc review: the 12-hex ``compile_id`` is not wide enough to serve as the strict-verification integrity token. Under a 48-bit collision (or an out-of-band swap that keeps both tables' short IDs aligned to a stale cache), the documented checks terminate at ``__meta`` and never touch a ``main`` row — main data from a different compile would pass as "verified." Option 2 splits the provenance surface into two columns with distinct roles, decided in issue #58#issuecomment-4311684137 and refined in #issuecomment-4311716470: - ``compile_id`` — 12 hex chars, display/debug token only. Used in reports, queue rows, error messages, log lines. **Never** the sole freshness check. - ``compile_fingerprint`` — 64 hex chars, canonical integrity key. Full SHA-256 over ``ontology_fingerprint || binding_fingerprint || compiler_version``. Used for main↔meta pair consistency and runtime verification against cached local fingerprints. Structural invariant: ``compile_id == compile_fingerprint[:12]``, enforced at the ``_fingerprint.py`` module boundary (short form derived from full form, not the reverse). Makes it impossible to ship a row where the display token does not correspond to the integrity key. Changes: - ``entity_resolution_primitives.md`` §4.2: concept-index schema gains ``compile_fingerprint STRING NOT NULL`` column; add a two-role table pinning the column contract; state the invariant. - §5 verification queries: rewritten to use ``compile_fingerprint`` on all three checks. No short-ID arithmetic on the verification path. - §10 W2 watchpoint: rewritten under the new contract. Old "48-bit collision" caveat retired because the collision vector no longer exists — strict queries never read ``compile_id``. - §11 Decisions pinned: Option 2 recorded as a closed decision. - ``implementation_plan_concept_index_runtime.md`` A1: updated to describe the three exports (``fingerprint_model``, ``compile_fingerprint``, ``compile_id``) and the structural derivation. - A2, A4, A5: both provenance columns now explicitly listed. - C4: TTL re-check query rewritten against ``compile_fingerprint``. - D11, D14: test scope updated to match. - W2 watchpoint (plan): rewritten to match the RFC. Downstream sequencing unchanged: 1. ``_fingerprint.py`` extension (PR #71 additive update) — adds ``compile_fingerprint()`` as the primitive; ``compile_id()`` is derived. Non-breaking. 2. A2 row builder consumes the two-column contract. 3. A3 emission SQL writes both columns. 4. B1/C4 verifier queries ``compile_fingerprint``. No ``_fingerprint.py`` changes in this PR — that lands on #71. * docs: address PR #80 review findings Three fixes: 1. **Compile vs execute permissions (finding 1).** The plan's BQ-permissions rollout note said ``--emit-concept-index requires bigquery.tables.create`` and called that an "existing compile_graph() requirement." Both parts misstate the contract: ``gm compile`` emits SQL to stdout/``--output`` and never calls BigQuery. Rewritten to separate compile-time (local file access only) from execute-time (the emitted SQL, run via ``bq query`` / console / Airflow, needs ``bigquery.tables.create``). Matches the actual existing contract for ``compile_graph()``. 2. **Exact payload contract (finding 2).** Prior wording in the RFC table and plan A1 description described the payload as ``ontology_fingerprint || binding_fingerprint || compiler_version``, where ``||`` is ambiguous — a literal reading is plain concatenation, which is not what the implementation does. Now pinned as "SHA-256 over the NUL-delimited UTF-8 of the three inputs" in both places, with the explicit instruction to call ``compile_fingerprint()`` rather than reimplement the payload. Paired with the golden-vector test landing on PR #71. 3. **§7 RFC status row (finding 4).** The row for ``_fingerprint.py`` listed only ``fingerprint_model`` and ``compile_id``. Option 2 ships three exports; the row now names all three and states the derivation relationship. * docs(rfc): remove last compile/execute ambiguity in §1 directions table Final nit from PR #80 review. The RFC's 'How the three directions compose' table still had 'gm compile → DDL + concept-index SQL published to BQ' on the Direction 1 row, which reads as if the compile step itself publishes to BigQuery. Matches the permissions note in the plan doc: compile emits SQL, operator executes it.
1 parent b1f1d29 commit 7c34289

2 files changed

Lines changed: 60 additions & 26 deletions

File tree

docs/entity_resolution_primitives.md

Lines changed: 28 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Neither package is designed as a turn-time agent SDK. This RFC does not add an i
5050

5151
| Direction | Who calls it | When | What happens |
5252
| :---- | :---- | :---- | :---- |
53-
| 1 | Operator / CI | Once per ontology change (build-time) | `gm compile` → DDL \+ concept-index SQL published to BQ. |
53+
| 1 | Operator / CI | Once per ontology change (build-time) | `gm compile`emits DDL \+ concept-index SQL; operator executes the emitted SQL to publish the tables to BigQuery. |
5454
| 2 | Batch orchestrator | Scheduled over accumulated traces (post-processing) | `extract_graph` / `extract_biz_nodes` from `bigquery_agent_analytics``AI.GENERATE` populates entity / relationship tables. |
5555
| 3 | Eval / analysis / curation pipeline (this RFC) | On accumulated data, at the pipeline's cadence | Pipeline imports `OntologyRuntime` \+ a resolver from `bigquery_agent_analytics` and calls `.resolve(...)` or `.validate_against_ontology(...)`. Each call is a BQ query against the concept index. |
5656

@@ -123,17 +123,27 @@ Emitted when `gm compile --emit-concept-index --concept-index-table <fqn>` is pa
123123

124124
```sql
125125
CREATE TABLE `{dataset}.ontology_concept_index` (
126-
entity_name STRING NOT NULL,
127-
label STRING NOT NULL, -- for label_kind='notation', holds notation value
128-
label_kind STRING NOT NULL, -- 'name'|'pref'|'alt'|'hidden'|'synonym'|'notation'
129-
notation STRING, -- per-entity display, repeats across rows
130-
scheme STRING, -- NULL = not in any scheme
131-
language STRING,
132-
is_abstract BOOL NOT NULL,
133-
compile_id STRING NOT NULL -- 12 hex chars; pair-consistency tag
126+
entity_name STRING NOT NULL,
127+
label STRING NOT NULL, -- for label_kind='notation', holds notation value
128+
label_kind STRING NOT NULL, -- 'name'|'pref'|'alt'|'hidden'|'synonym'|'notation'
129+
notation STRING, -- per-entity display, repeats across rows
130+
scheme STRING, -- NULL = not in any scheme
131+
language STRING,
132+
is_abstract BOOL NOT NULL,
133+
compile_id STRING NOT NULL, -- 12 hex chars; display/debug token only
134+
compile_fingerprint STRING NOT NULL -- 64 hex chars; canonical integrity key
134135
);
135136
```
136137

138+
**Two provenance columns, one role each.**
139+
140+
| Column | Role | Width | Used by |
141+
|---|---|---|---|
142+
| `compile_id` | Display/debug token — human-readable short tag for reports, queue rows, error messages, log lines. **Never the sole freshness check.** | 12 hex chars | Operator UX, dashboards, triage output |
143+
| `compile_fingerprint` | Canonical integrity key — full SHA-256 over the NUL-delimited UTF-8 of `(ontology_fingerprint, binding_fingerprint, compiler_version)`. Consumers must call `_fingerprint.compile_fingerprint()`; do not reimplement. | 64 hex chars | Strict pair-consistency + runtime verification (§5) |
144+
145+
Structural invariant: `compile_id == compile_fingerprint[:12]`. The short form is always derivable from the full form; never the other way around. Enforced at the `_fingerprint.py` module boundary so a future refactor cannot make them diverge.
146+
137147
**Row multiplicity:** one row per `(entity_name, label, label_kind, language, scheme)` tuple — concept in 3 schemes × 5 labels \= 15 rows. Resolvers filter by scheme without JOIN.
138148

139149
**Scope rule:** all abstract entities (informational — always included); concrete entities iff bound in the binding being compiled.
@@ -201,12 +211,15 @@ Docs (`docs/ontology/concept-index.md`, from A8) will carry two or three canonic
201211

202212
**TTL re-check queries (stale cache):**
203213

204-
1. `SELECT DISTINCT compile_id FROM {output_table} LIMIT 2` — asserts exactly one value (pair consistency). More than one → refresh in progress.
205-
2. `SELECT compile_id, ontology_fingerprint, binding_fingerprint FROM {output_table}__meta LIMIT 1` — asserts all three match cache (full-fingerprint freshness).
214+
1. `SELECT DISTINCT compile_fingerprint FROM {output_table} LIMIT 2` — asserts exactly one value (pair consistency at full-fingerprint resolution). More than one → refresh in progress.
215+
2. `SELECT compile_fingerprint, ontology_fingerprint, binding_fingerprint FROM {output_table}__meta LIMIT 1`.
216+
3. Require: `main.compile_fingerprint == meta.compile_fingerprint` (pair consistency) **and** `meta.ontology_fingerprint == cached.ontology_fingerprint` **and** `meta.binding_fingerprint == cached.binding_fingerprint` (component freshness).
217+
218+
No short-ID arithmetic anywhere on the verification path. `compile_id` never appears in a strict-mode query.
206219

207220
Main/meta disagreement → 2s one-shot retry → persistent \= `ConceptIndexInconsistentPair`. Cache drift \= `ConceptIndexRefreshed` (operator recreates `OntologyRuntime` with updated models).
208221

209-
Why both tables, why full fingerprints: see §9 W2.
222+
Why full fingerprints on both tables: see §10 W2.
210223

211224
## **6\. Tie to issue \#57**
212225

@@ -225,7 +238,7 @@ Concept-index value is \~80% from SKOS annotations preserved through import (\#5
225238

226239
| File | Change | Status |
227240
| :---- | :---- | :---- |
228-
| `_fingerprint.py` | **New internal**`fingerprint_model`, `compile_id` | **\#71 open** |
241+
| `_fingerprint.py` | **New internal**`fingerprint_model`, `compile_fingerprint` (canonical integrity key), `compile_id` (display token, derived as `compile_fingerprint(...)[:12]`) | **\#71 open** |
229242
| `concept_index.py` | New row builder | Pending A2 |
230243
| `graph_ddl_compiler.py` | Add `compile_concept_index`. `compile_graph` unchanged | Pending A3–A5 |
231244
| `cli.py:299` | Add `--emit-concept-index` \+ `--concept-index-table`; no-flag byte-identical | Pending A7 |
@@ -285,7 +298,7 @@ Single developer ≈ 7.5 weeks. Phases 1 \+ 2 parallelizable → \~4 weeks wall-
285298
| \# | Invariant | Failure mode if broken | Regression test |
286299
| :---- | :---- | :---- | :---- |
287300
| W1 | `_fingerprint.py` is the **single** source of canonical serialization; both packages import it | Compiler writes fingerprint X, runtime computes fingerprint Y, strict mode rejects every valid index | `tests/bigquery_ontology/test_fingerprint.py`: round-trip YAML → load → fingerprint; semantic edits change it, whitespace edits don't (landed in \#71) |
288-
| W2 | TTL re-check reads **both** tables with **full** fingerprints | Meta-only sentinel: refresh-window race → wrong data under "verified." Short-compile-id only: 48-bit collision → same class of failure | Mock race window (meta old, main new with different compile\_id but matching short prefix); assert runtime catches it. Assert single-table sentinel impl fails the test |
301+
| W2 | Strict verification uses `compile_fingerprint` (full 64-hex) on both tables — short `compile_id` never appears on the verification path | A reducer "optimization" to `SELECT compile_id FROM ...` would reintroduce the 48-bit collision hole under an out-of-band swap. A meta-only sentinel would reintroduce the refresh-window race | Assert strict-mode queries reference `compile_fingerprint` only; assert short-ID reducer fails a reintroduction test. Mock main/meta full-fingerprint mismatch and assert `ConceptIndexInconsistentPair` |
289302
| W3 | Shadow-swap is **non-self-healing**; compiler errors out and next `gm compile` resumes | Background retry loops mask partial-swap states; operator "pause traffic during shadow refresh" guidance becomes unenforceable | Inject mid-swap `DROP`/`RENAME` failure → `gm compile` errors with clear message; subsequent `gm compile` completes the swap without recompiling |
290303

291304
### **Deferred (tracked, not blocking)**
@@ -305,6 +318,7 @@ Single developer ≈ 7.5 weeks. Phases 1 \+ 2 parallelizable → \~4 weeks wall-
305318
- `scheme=` XOR `entity=` in v1. Narrower-closure in v2 only if real callers ask.
306319
- `contrib/` for reference resolvers; external packages for user-owned domains.
307320
- Strict verification on by default; `verify_concept_index="off"` is the explicit opt-out.
321+
- **Option 2 for provenance columns: `compile_fingerprint` is the canonical integrity key; `compile_id` is display-only.** Invariant `compile_id == compile_fingerprint[:12]` enforced at the `_fingerprint.py` module boundary. Short-ID arithmetic is forbidden on the strict verification path.
308322

309323
## **12\. Future directions — LLM composition (not in v1)**
310324

0 commit comments

Comments
 (0)