Feat: validate ontology binding against existing BigQuery table schemas (pre-flight)

# Feat: validate ontology binding against existing BigQuery table schemas (pre-flight)

## Goal

Add a pre-flight validator that checks whether the BigQuery tables a binding YAML points at physically exist with the columns and types the binding requires, **before** the SDK starts extraction or materialization.

This is a different validator from #76. #76 validates `ExtractedGraph` against `ResolvedGraph` (extracted output vs. logical spec). This issue validates `Binding`+target tables against actual BigQuery schemas (logical spec vs. physical reality).

## Motivation

Today the SDK supports populating user-pre-defined BigQuery property graphs from BQ AA traces (see CLI `ontology-build` and the Python composition path). The binding YAML expresses node/edge table topology cleanly via `EntityBinding.source` + `properties[].column` and `RelationshipBinding.source` + `from_columns`/`to_columns`.

But the loader does not check whether those tables and columns actually exist with compatible types. From [`bigquery_ontology/binding_models.py:71–84`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_ontology/binding_models.py#L71):

> "Type compatibility is not checked here — the physical column type must already match the ontology property type, upstream."

Failure modes today, all surfacing late and with poor error messages:

- **Missing table** — surfaces during `OntologyMaterializer._batch_load_table` when `bq_client.get_table(table_ref)` raises `NotFound`. Materialization for that table is then logged as `delete_failed` or `insert_failed`. Other tables may proceed; the run reports partial success.
- **Missing column on a node table** — `INSERT INTO <table> (col1, col2, ...) SELECT ... FROM staging` fails with a BigQuery schema error after staging data has been loaded. Cost wasted; debugging is on the user.
- **Type mismatch** — surfaces at INSERT-from-staging time as a cast failure. The error names the BQ column, not the ontology property, so users have to map back to the binding by hand.
- **Wrong endpoint key column** on an edge table — INSERT may succeed (if the column exists with a compatible type) but produces semantic garbage: edges referencing rows that don't exist on the endpoint table. No error raised; data quality drops silently.

For users with a pre-defined BigQuery property graph (their tables already exist, defined by Terraform / dbt / hand-authored DDL), this is the most common class of authoring mistake — the binding YAML and the physical tables drift out of sync, and the SDK only finds out after extraction has already run.

## What this validator checks

The validator's first step is to call `resolve(ontology, binding)` and operate against the resulting `ResolvedGraph`. That gives it (a) the resolved per-element source via `ResolvedEntity.source` / `ResolvedRelationship.source` (which `_qualify_source` at [`resolved_spec.py:141`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/resolved_spec.py#L141) has already qualified, honoring fully-qualified `project.dataset.table` overrides over the binding `target` defaults), and (b) the column→SDK-type metadata via `ResolvedProperty`. The physical-vs-spec comparison then maps SDK property types to expected BigQuery DDL types via the materializer's `_DDL_TYPE_MAP` at [`ontology_materializer.py:125`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/ontology_materializer.py#L125) — that's the same map the SDK uses when it generates DDL itself, so consistency is automatic.

**Important: do not check non-nullability on key columns.** The SDK's own `create_tables()` emits plain column definitions with no `NOT NULL` ([`ontology_materializer.py:206–213`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/ontology_materializer.py#L206)). A required-NULLABLE-mode check would reject tables created by the SDK itself. Treat REQUIRED mode as a **`--strict` opt-in warning**, not a hard failure.

### Per `ResolvedEntity` (one per included entity binding)

1. **Table exists.** `bq_client.get_table(entity.source)` resolves. The source is already fully qualified by `resolve()`.
2. **Bound columns exist.** Every `ResolvedProperty.column` on the entity corresponds to an actual column on the table.
3. **Column types are compatible.** The BQ schema's column type matches `_DDL_TYPE_MAP[property.sdk_type]` ([`ontology_materializer.py:125`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/ontology_materializer.py#L125)). Direct match is required; near-match types (e.g., NUMERIC where BQ FLOAT64 was expected) are flagged.
4. **Key columns are not REPEATED.** BQ ARRAY-mode columns can't carry a primary key. This is a hard failure; nullability is not.
5. **Strict-mode only**: key columns are REQUIRED (non-nullable) in the BQ schema. Off by default; gated behind `--strict` because the SDK's own DDL doesn't enforce it.

### Per `ResolvedRelationship` (one per included relationship binding)

6. **Edge table exists.** `bq_client.get_table(relationship.source)` resolves.
7. **Endpoint columns exist.** `from_columns` and `to_columns` exist on the edge table.
8. **Endpoint columns are not REPEATED.** Same reason as #4.
9. **Endpoint column types match the referenced entity's primary-key column types.** Cross-table consistency check that today only surfaces at INSERT time as a cast failure.
10. **Property columns exist with compatible types** (same comparison rule as #2–3).
11. **Strict-mode only**: endpoint columns are REQUIRED. Off by default.

### Per binding root

12. **Every resolved table reference is accessible.** For each unique `entity.source` and `relationship.source`, `get_table(...)` succeeds and the calling identity has at minimum `bigquery.tables.get`. Failures are reported best-effort with `code=INSUFFICIENT_PERMISSIONS` or `code=MISSING_DATASET` so users see actionable IAM errors before extraction starts.

(Replaces the previous root-level "target.project/target.dataset exists" framing — sources can be fully qualified and override the target defaults via [`_qualify_source`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/resolved_spec.py#L141), so per-source accessibility is the right unit.)

Out of scope for this validator:

- Whether the property graph object exists or is consistent with the binding. That is a separate "graph-vs-binding" concern; not all users have a property graph, and the base-table contract is the actual durable constraint.
- Row-level data validation (e.g., do the existing rows satisfy the ontology). That is the materializer's behavior post-load.
- Validating extracted graph rows against the spec — that is #76.

## Proposed API

```python
from bigquery_agent_analytics.binding_validation import (
    validate_binding_against_bigquery,
    BindingValidationReport,
    BindingValidationFailure,
    BindingValidationWarning,
    FailureCode,
)

report = validate_binding_against_bigquery(
    binding=loaded_binding,
    ontology=loaded_ontology,
    bq_client=bigquery.Client(project="my-project", location="US"),
    strict=False,  # default; pass True to escalate strict-only checks
)

if not report.ok:
    for failure in report.failures:
        print(failure)
for warning in report.warnings:
    print(f"WARN: {warning}")
```

`validate_binding_against_bigquery` signature:

- `strict: bool = False` — when `False` (default), strict-only checks (today: `KEY_COLUMN_NULLABLE`) emit `BindingValidationWarning` entries instead of failures. When `True`, the same checks emit `BindingValidationFailure` entries with the same `code`. Default is permissive so the validator does not reject tables produced by the SDK's own `CREATE TABLE IF NOT EXISTS` DDL.

`BindingValidationFailure` carries:

- `code: FailureCode` — typed enum: `MISSING_TABLE`, `MISSING_COLUMN`, `TYPE_MISMATCH`, `ENDPOINT_TYPE_MISMATCH`, `UNEXPECTED_REPEATED_MODE`, `MISSING_DATASET`, `INSUFFICIENT_PERMISSIONS`. Strict-only (escalates from warning to failure under `strict=True`): `KEY_COLUMN_NULLABLE`.
- `binding_element: str` — the entity or relationship name in the binding.
- `binding_path: str` — `binding.entities[3].properties[1].column`-style path for tooling.
- `bq_ref: str` — fully-qualified `project.dataset.table[.column]` the failure is about.
- `expected: Any` — what the binding declared.
- `observed: Any` — what BigQuery reports (may be `None` for missing-table cases).
- `detail: str` — human-readable.

`BindingValidationWarning` carries the same fields as `BindingValidationFailure` (so callers can format them uniformly). The distinction is semantic: warnings do not flip `report.ok` to `False`.

`BindingValidationReport`:

- `failures: list[BindingValidationFailure]` — hard failures (always present in default and strict mode).
- `warnings: list[BindingValidationWarning]` — strict-only checks that emitted in default mode (empty under `strict=True` because they got escalated to `failures`).
- `ok` returns `True` iff `failures` is empty. **Warnings do not affect `ok`** — they're advisory in default mode and are escalated by `strict=True`.

## CLI integration

Two surfaces:

1. **Standalone**: `bq-agent-sdk binding-validate --ontology X.yaml --binding Y.yaml [--project-id ...] [--location ...] [--strict]`. Exits 0 if `report.ok` (no failures; warnings allowed unless `--strict`), exits 1 with a printable failure list otherwise. Warnings are printed to stderr in default mode but do not flip the exit code. With `--strict`, warnings escalate to failures and the same `KEY_COLUMN_NULLABLE` checks become exit-1.
2. **Optional pre-flight on `ontology-build`**: `--validate-binding` flag (and `--validate-binding-strict` for the strict variant) that runs this validator before phase 2 and exits early on failures. Off by default to preserve current speed; opt-in keeps the CLI fast for users who own their tables and have already validated.

## Acceptance criteria

- [ ] `validate_binding_against_bigquery(binding, ontology, bq_client)` exists in `bigquery_agent_analytics.binding_validation` (or similar) and returns `BindingValidationReport`.
- [ ] All seven default-mode failure codes (`MISSING_TABLE`, `MISSING_COLUMN`, `TYPE_MISMATCH`, `ENDPOINT_TYPE_MISMATCH`, `UNEXPECTED_REPEATED_MODE`, `MISSING_DATASET`, `INSUFFICIENT_PERMISSIONS`) covered by unit tests against a fake BigQuery client. `KEY_COLUMN_NULLABLE` covered as: (a) default-mode test asserts no failure is raised against an SDK-created table whose key columns are NULLABLE — instead a `BindingValidationWarning` is appended to `report.warnings` and `report.ok` stays `True`; (b) strict-mode test (`strict=True`) asserts the same input produces a `BindingValidationFailure` with `code=KEY_COLUMN_NULLABLE` and `report.ok` is `False`.
- [ ] `validate_binding_against_bigquery(..., strict=False)` is the default; explicit unit test covers the `strict=True` escalation path.
- [ ] `bq-agent-sdk binding-validate --strict` exits 1 on key-nullable violations; without `--strict` exits 0 and prints warnings to stderr.
- [ ] Default-mode regression test: running the validator against tables produced by `OntologyMaterializer.create_tables()` returns `report.ok == True`. Catches the "validator rejects SDK-created tables" trap.
- [ ] Cross-project test: a binding whose `entity.source` is fully qualified to a project distinct from `binding.target.project` validates against the entity's project, not the target's.
- [ ] One live integration test (gated on `RUN_LIVE_BIGQUERY_TESTS=1`, matching the existing ontology integration test pattern at [`tests/test_integration_ontology_binding.py:44`](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/tests/test_integration_ontology_binding.py#L44)) creates a binding pointing at a deliberately-mismatched fixture table and asserts the report flags the mismatch with the right code.
- [ ] `bq-agent-sdk binding-validate` CLI exists and exits non-zero on any failure.
- [ ] `bq-agent-sdk ontology-build --validate-binding` runs the validator pre-extraction and short-circuits with a printable error list.
- [ ] `docs/ontology/binding-validation.md` documents the failure codes, the API, and the CI usage pattern.

## Why this is separate from #76

- **#76** validates that an extracted graph (output of the LLM extractor) conforms to the ontology spec. Run after extraction, before materialization. Inputs: `ResolvedGraph` + `ExtractedGraph`.
- **This issue** validates that the binding declared in YAML lines up with the physical tables it references on BigQuery. Run before extraction. Inputs: `Binding` + `Ontology` + a live BQ client.

Different inputs, different code path, different failure modes, different fix shapes (rewrite the extractor's prompt vs. fix the binding YAML or the table DDL). Sharing a name would obscure that.

## Why this is separate from the `--skip-property-graph` flag issue

The `--skip-property-graph` flag changes orchestration behavior (don't write the graph object). This issue adds a new validation surface (don't start extraction if the bound tables don't match). They compose: a user with pre-defined tables and a pre-defined property graph wants both `--skip-property-graph` *and* `--validate-binding`. Filing them separately keeps each PR small.

## Related

- #76 — extracted-graph validator (sibling, runs at a different phase).
- `--skip-property-graph` flag — companion issue (filed separately).
- #58 — runtime entity-resolution primitives. Shares no direct code path; both build on the abstract/concrete split that PR #62 introduced.

## Effort

2–3 eng-days. The validator itself is mechanical (BQ schema lookups + a small comparison matrix), but the unit test fixtures take time to set up well, and the CLI threading is two surfaces.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: validate ontology binding against existing BigQuery table schemas (pre-flight) #105

Feat: validate ontology binding against existing BigQuery table schemas (pre-flight)

Goal

Motivation

What this validator checks

Per `ResolvedEntity` (one per included entity binding)

Per `ResolvedRelationship` (one per included relationship binding)

Per binding root

Proposed API

CLI integration

Acceptance criteria

Why this is separate from #76

Why this is separate from the `--skip-property-graph` flag issue

Related

Effort

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feat: validate ontology binding against existing BigQuery table schemas (pre-flight) #105

Description

Feat: validate ontology binding against existing BigQuery table schemas (pre-flight)

Goal

Motivation

What this validator checks

Per ResolvedEntity (one per included entity binding)

Per ResolvedRelationship (one per included relationship binding)

Per binding root

Proposed API

CLI integration

Acceptance criteria

Why this is separate from #76

Why this is separate from the --skip-property-graph flag issue

Related

Effort

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Per `ResolvedEntity` (one per included entity binding)

Per `ResolvedRelationship` (one per included relationship binding)

Why this is separate from the `--skip-property-graph` flag issue