You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feat: validate ontology binding against existing BigQuery table schemas (pre-flight)
Goal
Add a pre-flight validator that checks whether the BigQuery tables a binding YAML points at physically exist with the columns and types the binding requires, before the SDK starts extraction or materialization.
This is a different validator from #76. #76 validates ExtractedGraph against ResolvedGraph (extracted output vs. logical spec). This issue validates Binding+target tables against actual BigQuery schemas (logical spec vs. physical reality).
Motivation
Today the SDK supports populating user-pre-defined BigQuery property graphs from BQ AA traces (see CLI ontology-build and the Python composition path). The binding YAML expresses node/edge table topology cleanly via EntityBinding.source + properties[].column and RelationshipBinding.source + from_columns/to_columns.
"Type compatibility is not checked here — the physical column type must already match the ontology property type, upstream."
Failure modes today, all surfacing late and with poor error messages:
Missing table — surfaces during OntologyMaterializer._batch_load_table when bq_client.get_table(table_ref) raises NotFound. Materialization for that table is then logged as delete_failed or insert_failed. Other tables may proceed; the run reports partial success.
Missing column on a node table — INSERT INTO <table> (col1, col2, ...) SELECT ... FROM staging fails with a BigQuery schema error after staging data has been loaded. Cost wasted; debugging is on the user.
Type mismatch — surfaces at INSERT-from-staging time as a cast failure. The error names the BQ column, not the ontology property, so users have to map back to the binding by hand.
Wrong endpoint key column on an edge table — INSERT may succeed (if the column exists with a compatible type) but produces semantic garbage: edges referencing rows that don't exist on the endpoint table. No error raised; data quality drops silently.
For users with a pre-defined BigQuery property graph (their tables already exist, defined by Terraform / dbt / hand-authored DDL), this is the most common class of authoring mistake — the binding YAML and the physical tables drift out of sync, and the SDK only finds out after extraction has already run.
What this validator checks
The validator's first step is to call resolve(ontology, binding) and operate against the resulting ResolvedGraph. That gives it (a) the resolved per-element source via ResolvedEntity.source / ResolvedRelationship.source (which _qualify_source at resolved_spec.py:141 has already qualified, honoring fully-qualified project.dataset.table overrides over the binding target defaults), and (b) the column→SDK-type metadata via ResolvedProperty. The physical-vs-spec comparison then maps SDK property types to expected BigQuery DDL types via the materializer's _DDL_TYPE_MAP at ontology_materializer.py:125 — that's the same map the SDK uses when it generates DDL itself, so consistency is automatic.
Important: do not check non-nullability on key columns. The SDK's own create_tables() emits plain column definitions with no NOT NULL (ontology_materializer.py:206–213). A required-NULLABLE-mode check would reject tables created by the SDK itself. Treat REQUIRED mode as a --strict opt-in warning, not a hard failure.
Per ResolvedEntity (one per included entity binding)
Table exists.bq_client.get_table(entity.source) resolves. The source is already fully qualified by resolve().
Bound columns exist. Every ResolvedProperty.column on the entity corresponds to an actual column on the table.
Column types are compatible. The BQ schema's column type matches _DDL_TYPE_MAP[property.sdk_type] (ontology_materializer.py:125). Direct match is required; near-match types (e.g., NUMERIC where BQ FLOAT64 was expected) are flagged.
Key columns are not REPEATED. BQ ARRAY-mode columns can't carry a primary key. This is a hard failure; nullability is not.
Strict-mode only: key columns are REQUIRED (non-nullable) in the BQ schema. Off by default; gated behind --strict because the SDK's own DDL doesn't enforce it.
Per ResolvedRelationship (one per included relationship binding)
Endpoint column types match the referenced entity's primary-key column types. Cross-table consistency check that today only surfaces at INSERT time as a cast failure.
Strict-mode only: endpoint columns are REQUIRED. Off by default.
Per binding root
Every resolved table reference is accessible. For each unique entity.source and relationship.source, get_table(...) succeeds and the calling identity has at minimum bigquery.tables.get. Failures are reported best-effort with code=INSUFFICIENT_PERMISSIONS or code=MISSING_DATASET so users see actionable IAM errors before extraction starts.
(Replaces the previous root-level "target.project/target.dataset exists" framing — sources can be fully qualified and override the target defaults via _qualify_source, so per-source accessibility is the right unit.)
Out of scope for this validator:
Whether the property graph object exists or is consistent with the binding. That is a separate "graph-vs-binding" concern; not all users have a property graph, and the base-table contract is the actual durable constraint.
Row-level data validation (e.g., do the existing rows satisfy the ontology). That is the materializer's behavior post-load.
strict: bool = False — when False (default), strict-only checks (today: KEY_COLUMN_NULLABLE) emit BindingValidationWarning entries instead of failures. When True, the same checks emit BindingValidationFailure entries with the same code. Default is permissive so the validator does not reject tables produced by the SDK's own CREATE TABLE IF NOT EXISTS DDL.
BindingValidationFailure carries:
code: FailureCode — typed enum: MISSING_TABLE, MISSING_COLUMN, TYPE_MISMATCH, ENDPOINT_TYPE_MISMATCH, UNEXPECTED_REPEATED_MODE, MISSING_DATASET, INSUFFICIENT_PERMISSIONS. Strict-only (escalates from warning to failure under strict=True): KEY_COLUMN_NULLABLE.
binding_element: str — the entity or relationship name in the binding.
binding_path: str — binding.entities[3].properties[1].column-style path for tooling.
bq_ref: str — fully-qualified project.dataset.table[.column] the failure is about.
expected: Any — what the binding declared.
observed: Any — what BigQuery reports (may be None for missing-table cases).
detail: str — human-readable.
BindingValidationWarning carries the same fields as BindingValidationFailure (so callers can format them uniformly). The distinction is semantic: warnings do not flip report.ok to False.
BindingValidationReport:
failures: list[BindingValidationFailure] — hard failures (always present in default and strict mode).
warnings: list[BindingValidationWarning] — strict-only checks that emitted in default mode (empty under strict=True because they got escalated to failures).
ok returns True iff failures is empty. Warnings do not affect ok — they're advisory in default mode and are escalated by strict=True.
CLI integration
Two surfaces:
Standalone: bq-agent-sdk binding-validate --ontology X.yaml --binding Y.yaml [--project-id ...] [--location ...] [--strict]. Exits 0 if report.ok (no failures; warnings allowed unless --strict), exits 1 with a printable failure list otherwise. Warnings are printed to stderr in default mode but do not flip the exit code. With --strict, warnings escalate to failures and the same KEY_COLUMN_NULLABLE checks become exit-1.
Optional pre-flight on ontology-build: --validate-binding flag (and --validate-binding-strict for the strict variant) that runs this validator before phase 2 and exits early on failures. Off by default to preserve current speed; opt-in keeps the CLI fast for users who own their tables and have already validated.
Acceptance criteria
validate_binding_against_bigquery(binding, ontology, bq_client) exists in bigquery_agent_analytics.binding_validation (or similar) and returns BindingValidationReport.
All seven default-mode failure codes (MISSING_TABLE, MISSING_COLUMN, TYPE_MISMATCH, ENDPOINT_TYPE_MISMATCH, UNEXPECTED_REPEATED_MODE, MISSING_DATASET, INSUFFICIENT_PERMISSIONS) covered by unit tests against a fake BigQuery client. KEY_COLUMN_NULLABLE covered as: (a) default-mode test asserts no failure is raised against an SDK-created table whose key columns are NULLABLE — instead a BindingValidationWarning is appended to report.warnings and report.ok stays True; (b) strict-mode test (strict=True) asserts the same input produces a BindingValidationFailure with code=KEY_COLUMN_NULLABLE and report.ok is False.
validate_binding_against_bigquery(..., strict=False) is the default; explicit unit test covers the strict=True escalation path.
bq-agent-sdk binding-validate --strict exits 1 on key-nullable violations; without --strict exits 0 and prints warnings to stderr.
Default-mode regression test: running the validator against tables produced by OntologyMaterializer.create_tables() returns report.ok == True. Catches the "validator rejects SDK-created tables" trap.
Cross-project test: a binding whose entity.source is fully qualified to a project distinct from binding.target.project validates against the entity's project, not the target's.
One live integration test (gated on RUN_LIVE_BIGQUERY_TESTS=1, matching the existing ontology integration test pattern at tests/test_integration_ontology_binding.py:44) creates a binding pointing at a deliberately-mismatched fixture table and asserts the report flags the mismatch with the right code.
bq-agent-sdk binding-validate CLI exists and exits non-zero on any failure.
bq-agent-sdk ontology-build --validate-binding runs the validator pre-extraction and short-circuits with a printable error list.
docs/ontology/binding-validation.md documents the failure codes, the API, and the CI usage pattern.
This issue validates that the binding declared in YAML lines up with the physical tables it references on BigQuery. Run before extraction. Inputs: Binding + Ontology + a live BQ client.
Different inputs, different code path, different failure modes, different fix shapes (rewrite the extractor's prompt vs. fix the binding YAML or the table DDL). Sharing a name would obscure that.
Why this is separate from the --skip-property-graph flag issue
The --skip-property-graph flag changes orchestration behavior (don't write the graph object). This issue adds a new validation surface (don't start extraction if the bound tables don't match). They compose: a user with pre-defined tables and a pre-defined property graph wants both --skip-property-graphand--validate-binding. Filing them separately keeps each PR small.
2–3 eng-days. The validator itself is mechanical (BQ schema lookups + a small comparison matrix), but the unit test fixtures take time to set up well, and the CLI threading is two surfaces.
Feat: validate ontology binding against existing BigQuery table schemas (pre-flight)
Goal
Add a pre-flight validator that checks whether the BigQuery tables a binding YAML points at physically exist with the columns and types the binding requires, before the SDK starts extraction or materialization.
This is a different validator from #76. #76 validates
ExtractedGraphagainstResolvedGraph(extracted output vs. logical spec). This issue validatesBinding+target tables against actual BigQuery schemas (logical spec vs. physical reality).Motivation
Today the SDK supports populating user-pre-defined BigQuery property graphs from BQ AA traces (see CLI
ontology-buildand the Python composition path). The binding YAML expresses node/edge table topology cleanly viaEntityBinding.source+properties[].columnandRelationshipBinding.source+from_columns/to_columns.But the loader does not check whether those tables and columns actually exist with compatible types. From
bigquery_ontology/binding_models.py:71–84:Failure modes today, all surfacing late and with poor error messages:
OntologyMaterializer._batch_load_tablewhenbq_client.get_table(table_ref)raisesNotFound. Materialization for that table is then logged asdelete_failedorinsert_failed. Other tables may proceed; the run reports partial success.INSERT INTO <table> (col1, col2, ...) SELECT ... FROM stagingfails with a BigQuery schema error after staging data has been loaded. Cost wasted; debugging is on the user.For users with a pre-defined BigQuery property graph (their tables already exist, defined by Terraform / dbt / hand-authored DDL), this is the most common class of authoring mistake — the binding YAML and the physical tables drift out of sync, and the SDK only finds out after extraction has already run.
What this validator checks
The validator's first step is to call
resolve(ontology, binding)and operate against the resultingResolvedGraph. That gives it (a) the resolved per-element source viaResolvedEntity.source/ResolvedRelationship.source(which_qualify_sourceatresolved_spec.py:141has already qualified, honoring fully-qualifiedproject.dataset.tableoverrides over the bindingtargetdefaults), and (b) the column→SDK-type metadata viaResolvedProperty. The physical-vs-spec comparison then maps SDK property types to expected BigQuery DDL types via the materializer's_DDL_TYPE_MAPatontology_materializer.py:125— that's the same map the SDK uses when it generates DDL itself, so consistency is automatic.Important: do not check non-nullability on key columns. The SDK's own
create_tables()emits plain column definitions with noNOT NULL(ontology_materializer.py:206–213). A required-NULLABLE-mode check would reject tables created by the SDK itself. Treat REQUIRED mode as a--strictopt-in warning, not a hard failure.Per
ResolvedEntity(one per included entity binding)bq_client.get_table(entity.source)resolves. The source is already fully qualified byresolve().ResolvedProperty.columnon the entity corresponds to an actual column on the table._DDL_TYPE_MAP[property.sdk_type](ontology_materializer.py:125). Direct match is required; near-match types (e.g., NUMERIC where BQ FLOAT64 was expected) are flagged.--strictbecause the SDK's own DDL doesn't enforce it.Per
ResolvedRelationship(one per included relationship binding)bq_client.get_table(relationship.source)resolves.from_columnsandto_columnsexist on the edge table.Per binding root
entity.sourceandrelationship.source,get_table(...)succeeds and the calling identity has at minimumbigquery.tables.get. Failures are reported best-effort withcode=INSUFFICIENT_PERMISSIONSorcode=MISSING_DATASETso users see actionable IAM errors before extraction starts.(Replaces the previous root-level "target.project/target.dataset exists" framing — sources can be fully qualified and override the target defaults via
_qualify_source, so per-source accessibility is the right unit.)Out of scope for this validator:
Proposed API
validate_binding_against_bigquerysignature:strict: bool = False— whenFalse(default), strict-only checks (today:KEY_COLUMN_NULLABLE) emitBindingValidationWarningentries instead of failures. WhenTrue, the same checks emitBindingValidationFailureentries with the samecode. Default is permissive so the validator does not reject tables produced by the SDK's ownCREATE TABLE IF NOT EXISTSDDL.BindingValidationFailurecarries:code: FailureCode— typed enum:MISSING_TABLE,MISSING_COLUMN,TYPE_MISMATCH,ENDPOINT_TYPE_MISMATCH,UNEXPECTED_REPEATED_MODE,MISSING_DATASET,INSUFFICIENT_PERMISSIONS. Strict-only (escalates from warning to failure understrict=True):KEY_COLUMN_NULLABLE.binding_element: str— the entity or relationship name in the binding.binding_path: str—binding.entities[3].properties[1].column-style path for tooling.bq_ref: str— fully-qualifiedproject.dataset.table[.column]the failure is about.expected: Any— what the binding declared.observed: Any— what BigQuery reports (may beNonefor missing-table cases).detail: str— human-readable.BindingValidationWarningcarries the same fields asBindingValidationFailure(so callers can format them uniformly). The distinction is semantic: warnings do not flipreport.oktoFalse.BindingValidationReport:failures: list[BindingValidationFailure]— hard failures (always present in default and strict mode).warnings: list[BindingValidationWarning]— strict-only checks that emitted in default mode (empty understrict=Truebecause they got escalated tofailures).okreturnsTrueifffailuresis empty. Warnings do not affectok— they're advisory in default mode and are escalated bystrict=True.CLI integration
Two surfaces:
bq-agent-sdk binding-validate --ontology X.yaml --binding Y.yaml [--project-id ...] [--location ...] [--strict]. Exits 0 ifreport.ok(no failures; warnings allowed unless--strict), exits 1 with a printable failure list otherwise. Warnings are printed to stderr in default mode but do not flip the exit code. With--strict, warnings escalate to failures and the sameKEY_COLUMN_NULLABLEchecks become exit-1.ontology-build:--validate-bindingflag (and--validate-binding-strictfor the strict variant) that runs this validator before phase 2 and exits early on failures. Off by default to preserve current speed; opt-in keeps the CLI fast for users who own their tables and have already validated.Acceptance criteria
validate_binding_against_bigquery(binding, ontology, bq_client)exists inbigquery_agent_analytics.binding_validation(or similar) and returnsBindingValidationReport.MISSING_TABLE,MISSING_COLUMN,TYPE_MISMATCH,ENDPOINT_TYPE_MISMATCH,UNEXPECTED_REPEATED_MODE,MISSING_DATASET,INSUFFICIENT_PERMISSIONS) covered by unit tests against a fake BigQuery client.KEY_COLUMN_NULLABLEcovered as: (a) default-mode test asserts no failure is raised against an SDK-created table whose key columns are NULLABLE — instead aBindingValidationWarningis appended toreport.warningsandreport.okstaysTrue; (b) strict-mode test (strict=True) asserts the same input produces aBindingValidationFailurewithcode=KEY_COLUMN_NULLABLEandreport.okisFalse.validate_binding_against_bigquery(..., strict=False)is the default; explicit unit test covers thestrict=Trueescalation path.bq-agent-sdk binding-validate --strictexits 1 on key-nullable violations; without--strictexits 0 and prints warnings to stderr.OntologyMaterializer.create_tables()returnsreport.ok == True. Catches the "validator rejects SDK-created tables" trap.entity.sourceis fully qualified to a project distinct frombinding.target.projectvalidates against the entity's project, not the target's.RUN_LIVE_BIGQUERY_TESTS=1, matching the existing ontology integration test pattern attests/test_integration_ontology_binding.py:44) creates a binding pointing at a deliberately-mismatched fixture table and asserts the report flags the mismatch with the right code.bq-agent-sdk binding-validateCLI exists and exits non-zero on any failure.bq-agent-sdk ontology-build --validate-bindingruns the validator pre-extraction and short-circuits with a printable error list.docs/ontology/binding-validation.mddocuments the failure codes, the API, and the CI usage pattern.Why this is separate from #76
ResolvedGraph+ExtractedGraph.Binding+Ontology+ a live BQ client.Different inputs, different code path, different failure modes, different fix shapes (rewrite the extractor's prompt vs. fix the binding YAML or the table DDL). Sharing a name would obscure that.
Why this is separate from the
--skip-property-graphflag issueThe
--skip-property-graphflag changes orchestration behavior (don't write the graph object). This issue adds a new validation surface (don't start extraction if the bound tables don't match). They compose: a user with pre-defined tables and a pre-defined property graph wants both--skip-property-graphand--validate-binding. Filing them separately keeps each PR small.Related
--skip-property-graphflag — companion issue (filed separately).Effort
2–3 eng-days. The validator itself is mechanical (BQ schema lookups + a small comparison matrix), but the unit test fixtures take time to set up well, and the CLI threading is two surfaces.