Skip to content

Feat: validate ontology binding against existing BigQuery table schemas (pre-flight) #105

@caohy1988

Description

@caohy1988

Feat: validate ontology binding against existing BigQuery table schemas (pre-flight)

Goal

Add a pre-flight validator that checks whether the BigQuery tables a binding YAML points at physically exist with the columns and types the binding requires, before the SDK starts extraction or materialization.

This is a different validator from #76. #76 validates ExtractedGraph against ResolvedGraph (extracted output vs. logical spec). This issue validates Binding+target tables against actual BigQuery schemas (logical spec vs. physical reality).

Motivation

Today the SDK supports populating user-pre-defined BigQuery property graphs from BQ AA traces (see CLI ontology-build and the Python composition path). The binding YAML expresses node/edge table topology cleanly via EntityBinding.source + properties[].column and RelationshipBinding.source + from_columns/to_columns.

But the loader does not check whether those tables and columns actually exist with compatible types. From bigquery_ontology/binding_models.py:71–84:

"Type compatibility is not checked here — the physical column type must already match the ontology property type, upstream."

Failure modes today, all surfacing late and with poor error messages:

  • Missing table — surfaces during OntologyMaterializer._batch_load_table when bq_client.get_table(table_ref) raises NotFound. Materialization for that table is then logged as delete_failed or insert_failed. Other tables may proceed; the run reports partial success.
  • Missing column on a node tableINSERT INTO <table> (col1, col2, ...) SELECT ... FROM staging fails with a BigQuery schema error after staging data has been loaded. Cost wasted; debugging is on the user.
  • Type mismatch — surfaces at INSERT-from-staging time as a cast failure. The error names the BQ column, not the ontology property, so users have to map back to the binding by hand.
  • Wrong endpoint key column on an edge table — INSERT may succeed (if the column exists with a compatible type) but produces semantic garbage: edges referencing rows that don't exist on the endpoint table. No error raised; data quality drops silently.

For users with a pre-defined BigQuery property graph (their tables already exist, defined by Terraform / dbt / hand-authored DDL), this is the most common class of authoring mistake — the binding YAML and the physical tables drift out of sync, and the SDK only finds out after extraction has already run.

What this validator checks

The validator's first step is to call resolve(ontology, binding) and operate against the resulting ResolvedGraph. That gives it (a) the resolved per-element source via ResolvedEntity.source / ResolvedRelationship.source (which _qualify_source at resolved_spec.py:141 has already qualified, honoring fully-qualified project.dataset.table overrides over the binding target defaults), and (b) the column→SDK-type metadata via ResolvedProperty. The physical-vs-spec comparison then maps SDK property types to expected BigQuery DDL types via the materializer's _DDL_TYPE_MAP at ontology_materializer.py:125 — that's the same map the SDK uses when it generates DDL itself, so consistency is automatic.

Important: do not check non-nullability on key columns. The SDK's own create_tables() emits plain column definitions with no NOT NULL (ontology_materializer.py:206–213). A required-NULLABLE-mode check would reject tables created by the SDK itself. Treat REQUIRED mode as a --strict opt-in warning, not a hard failure.

Per ResolvedEntity (one per included entity binding)

  1. Table exists. bq_client.get_table(entity.source) resolves. The source is already fully qualified by resolve().
  2. Bound columns exist. Every ResolvedProperty.column on the entity corresponds to an actual column on the table.
  3. Column types are compatible. The BQ schema's column type matches _DDL_TYPE_MAP[property.sdk_type] (ontology_materializer.py:125). Direct match is required; near-match types (e.g., NUMERIC where BQ FLOAT64 was expected) are flagged.
  4. Key columns are not REPEATED. BQ ARRAY-mode columns can't carry a primary key. This is a hard failure; nullability is not.
  5. Strict-mode only: key columns are REQUIRED (non-nullable) in the BQ schema. Off by default; gated behind --strict because the SDK's own DDL doesn't enforce it.

Per ResolvedRelationship (one per included relationship binding)

  1. Edge table exists. bq_client.get_table(relationship.source) resolves.
  2. Endpoint columns exist. from_columns and to_columns exist on the edge table.
  3. Endpoint columns are not REPEATED. Same reason as Overhaul README, add documentation indexes, and fix CI issues #4.
  4. Endpoint column types match the referenced entity's primary-key column types. Cross-table consistency check that today only surfaces at INSERT time as a cast failure.
  5. Property columns exist with compatible types (same comparison rule as Revamp README, enhance documentation navigation, and fix CI #2–3).
  6. Strict-mode only: endpoint columns are REQUIRED. Off by default.

Per binding root

  1. Every resolved table reference is accessible. For each unique entity.source and relationship.source, get_table(...) succeeds and the calling identity has at minimum bigquery.tables.get. Failures are reported best-effort with code=INSUFFICIENT_PERMISSIONS or code=MISSING_DATASET so users see actionable IAM errors before extraction starts.

(Replaces the previous root-level "target.project/target.dataset exists" framing — sources can be fully qualified and override the target defaults via _qualify_source, so per-source accessibility is the right unit.)

Out of scope for this validator:

Proposed API

from bigquery_agent_analytics.binding_validation import (
    validate_binding_against_bigquery,
    BindingValidationReport,
    BindingValidationFailure,
    BindingValidationWarning,
    FailureCode,
)

report = validate_binding_against_bigquery(
    binding=loaded_binding,
    ontology=loaded_ontology,
    bq_client=bigquery.Client(project="my-project", location="US"),
    strict=False,  # default; pass True to escalate strict-only checks
)

if not report.ok:
    for failure in report.failures:
        print(failure)
for warning in report.warnings:
    print(f"WARN: {warning}")

validate_binding_against_bigquery signature:

  • strict: bool = False — when False (default), strict-only checks (today: KEY_COLUMN_NULLABLE) emit BindingValidationWarning entries instead of failures. When True, the same checks emit BindingValidationFailure entries with the same code. Default is permissive so the validator does not reject tables produced by the SDK's own CREATE TABLE IF NOT EXISTS DDL.

BindingValidationFailure carries:

  • code: FailureCode — typed enum: MISSING_TABLE, MISSING_COLUMN, TYPE_MISMATCH, ENDPOINT_TYPE_MISMATCH, UNEXPECTED_REPEATED_MODE, MISSING_DATASET, INSUFFICIENT_PERMISSIONS. Strict-only (escalates from warning to failure under strict=True): KEY_COLUMN_NULLABLE.
  • binding_element: str — the entity or relationship name in the binding.
  • binding_path: strbinding.entities[3].properties[1].column-style path for tooling.
  • bq_ref: str — fully-qualified project.dataset.table[.column] the failure is about.
  • expected: Any — what the binding declared.
  • observed: Any — what BigQuery reports (may be None for missing-table cases).
  • detail: str — human-readable.

BindingValidationWarning carries the same fields as BindingValidationFailure (so callers can format them uniformly). The distinction is semantic: warnings do not flip report.ok to False.

BindingValidationReport:

  • failures: list[BindingValidationFailure] — hard failures (always present in default and strict mode).
  • warnings: list[BindingValidationWarning] — strict-only checks that emitted in default mode (empty under strict=True because they got escalated to failures).
  • ok returns True iff failures is empty. Warnings do not affect ok — they're advisory in default mode and are escalated by strict=True.

CLI integration

Two surfaces:

  1. Standalone: bq-agent-sdk binding-validate --ontology X.yaml --binding Y.yaml [--project-id ...] [--location ...] [--strict]. Exits 0 if report.ok (no failures; warnings allowed unless --strict), exits 1 with a printable failure list otherwise. Warnings are printed to stderr in default mode but do not flip the exit code. With --strict, warnings escalate to failures and the same KEY_COLUMN_NULLABLE checks become exit-1.
  2. Optional pre-flight on ontology-build: --validate-binding flag (and --validate-binding-strict for the strict variant) that runs this validator before phase 2 and exits early on failures. Off by default to preserve current speed; opt-in keeps the CLI fast for users who own their tables and have already validated.

Acceptance criteria

  • validate_binding_against_bigquery(binding, ontology, bq_client) exists in bigquery_agent_analytics.binding_validation (or similar) and returns BindingValidationReport.
  • All seven default-mode failure codes (MISSING_TABLE, MISSING_COLUMN, TYPE_MISMATCH, ENDPOINT_TYPE_MISMATCH, UNEXPECTED_REPEATED_MODE, MISSING_DATASET, INSUFFICIENT_PERMISSIONS) covered by unit tests against a fake BigQuery client. KEY_COLUMN_NULLABLE covered as: (a) default-mode test asserts no failure is raised against an SDK-created table whose key columns are NULLABLE — instead a BindingValidationWarning is appended to report.warnings and report.ok stays True; (b) strict-mode test (strict=True) asserts the same input produces a BindingValidationFailure with code=KEY_COLUMN_NULLABLE and report.ok is False.
  • validate_binding_against_bigquery(..., strict=False) is the default; explicit unit test covers the strict=True escalation path.
  • bq-agent-sdk binding-validate --strict exits 1 on key-nullable violations; without --strict exits 0 and prints warnings to stderr.
  • Default-mode regression test: running the validator against tables produced by OntologyMaterializer.create_tables() returns report.ok == True. Catches the "validator rejects SDK-created tables" trap.
  • Cross-project test: a binding whose entity.source is fully qualified to a project distinct from binding.target.project validates against the entity's project, not the target's.
  • One live integration test (gated on RUN_LIVE_BIGQUERY_TESTS=1, matching the existing ontology integration test pattern at tests/test_integration_ontology_binding.py:44) creates a binding pointing at a deliberately-mismatched fixture table and asserts the report flags the mismatch with the right code.
  • bq-agent-sdk binding-validate CLI exists and exits non-zero on any failure.
  • bq-agent-sdk ontology-build --validate-binding runs the validator pre-extraction and short-circuits with a printable error list.
  • docs/ontology/binding-validation.md documents the failure codes, the API, and the CI usage pattern.

Why this is separate from #76

Different inputs, different code path, different failure modes, different fix shapes (rewrite the extractor's prompt vs. fix the binding YAML or the table DDL). Sharing a name would obscure that.

Why this is separate from the --skip-property-graph flag issue

The --skip-property-graph flag changes orchestration behavior (don't write the graph object). This issue adds a new validation surface (don't start extraction if the bound tables don't match). They compose: a user with pre-defined tables and a pre-defined property graph wants both --skip-property-graph and --validate-binding. Filing them separately keeps each PR small.

Related

Effort

2–3 eng-days. The validator itself is mechanical (BQ schema lookups + a small comparison matrix), but the unit test fixtures take time to set up well, and the CLI threading is two surfaces.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions