Epic: Compile-time code generation for structured trace extractors (scoped rework)

# Epic: Compile-time code generation for structured trace extractors

> **Scope revision (based on review feedback, 2026-04-23)**: the original proposal here treated this as a single implementation. That was too broad. Review found (a) the per-field fallback depended on ontology-level validation that doesn't exist today, (b) the runtime execution target was underspecified — current extraction runs *inside BigQuery* as server-side `AI.GENERATE`, not client-side Python, and emitting Python bundles is an architectural choice, not an implementation detail, (c) the proposal overgeneralized the paper's wins on structured workflow tasks to free-text extraction from arbitrary `LLM_RESPONSE` prose, and (d) the "every trace → LLM call" framing was imprecise — `ontology_graph.py` already aggregates events per session before the LLM call, only `context_graph.py`'s BizNode path is row-level. Reworking as an epic with scoped phases, prerequisites, and an explicit Phase 1 limited to compiling deterministic extractors from **known structured event schemas** (where the paper's evidence maps cleanly). Runtime `AI.GENERATE` stays as the semantic fallback until Phase 1 has measured precision/recall on real traces.

## Motivation

[arXiv:2604.05150 — *Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation*](https://arxiv.org/abs/2604.05150) (Trooskens et al., 2026, submitted April 6, 2026) treats the LLM as a **compile-time** code generator whose output is constrained to fill validated templates. Reported results on comparable tasks: 96% completion with zero execution tokens on function-calling (BFCL, n=400), 57× token reduction at 1,000 transactions, 80.4% accuracy on structured document extraction (DocILE, n=5,680).

The paper's evidence is strongest on **structured-to-structured** transformations — the target schema is known, the input has exploitable shape, and the task repeats at scale. That's exactly the profile of the SDK's structured event extractors (`structured_extraction.py`). It is **not** the profile of free-text semantic extraction from `LLM_RESPONSE` prose, which this issue no longer tries to compile in Phase 1.

## Current architecture (accurate characterization)

Three distinct extraction paths, with different costs:

1. **`context_graph.py:299` — row-level BQ-native `AI.GENERATE`** inside SQL/MERGE, invoked per row to pull business entities from `LLM_RESPONSE` text. This is the expensive, open-ended, free-text path.
2. **`ontology_graph.py:100, 631` — session-aggregated BQ-native `AI.GENERATE`**. Events for a session are assembled into a transcript inside SQL, then a single `AI.GENERATE` call produces JSON shaped by the compiled ontology schema. One LLM call per session, not per row — already amortized. This path is what the paper's evidence most-closely matches, but it's also the path where the validator and runtime-target prerequisites bite.
3. **`structured_extraction.py` — pure-Python structured extractors**. A registry of typed extractors (e.g., `extract_bka_decision_event`) that convert specific event shapes into `ExtractedNode` / `ExtractedEdge` without calling an LLM at all. Already deterministic. This is the natural **Phase 1 expansion point**.

The SDK isn't uniformly "one LLM call per trace." It's a three-tier hierarchy — deterministic registry → session-aggregated LLM → row-level LLM — and the right compilation target is the middle tier, only after the prerequisites below are in place.

## Prerequisites (Phase 0) — must land before any compilation work

### P0.1 — Ontology-aware validator *(tracked at #76; this section is a summary, **#76 is the source of truth**)*

> **Status (2026-05-02):** P0.1 is now scoped at [#76](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/issues/76). The signature and key-validation scope below have shifted since this section was written. Read #76 for the current contract; the bullets below are kept for historical context only.
>
> **What changed in #76 vs. this section:**
> - **Signature**: `validate_extracted_graph(spec: ResolvedGraph, graph: ExtractedGraph)` — `ResolvedGraph` is the runtime-facing spec surface (per `resolved_spec.py`), not `OntologySpec`. A thin `validate_extracted_graph_from_ontology(ontology, binding, graph)` adapter is provided for callers holding upstream `Ontology` + `Binding` instead.
> - **Key-validation scope**: primary keys + edge endpoint keys only for first landing. **Alternate-key validation is deferred** because `ResolvedEntity` doesn't currently carry resolved alternate-key metadata (it would need a new `alternate_key_columns` field built by `resolve()`). Tracked as a follow-up.
> - **Enum membership** is also deferred for the same reason — `ResolvedProperty` carries no enum value list today.

`extracted_models.py:18` defines `ExtractedProperty.value: Any`. `ExtractedGraph` validates container shape (nodes/edges are lists, keys present) but not ontology correctness: unknown field names survive, type mismatches pass, unknown entity types aren't rejected. `ontology_materializer.py:263` silently drops unknown fields and lets missing edge keys become empty strings.

The validator (per #76) checks:

- Every `ExtractedNode.entity_name` matches a declared entity in the spec.
- Every `ExtractedProperty.name` on a node or edge exists on that entity/relationship in the spec, and `value` satisfies the declared type (per ontology v0: scalar-ish types plus `json`; arrays and structs are explicitly deferred in `docs/ontology/ontology.md:293` and should be modeled as separate entities + relationships, not as nested properties).
- Every edge's `from_node_id` and `to_node_id` resolve to nodes in the graph or to external node-refs matching the declared endpoint entity.
- **Required keys are present**. Per #76's first-landing scope, "required" means entity **primary keys** (from `ResolvedEntity.key_columns`) and **edge endpoint keys** (from `ResolvedRelationship.from_columns` / `to_columns`) only — not alternate keys, not every declared property. A property that isn't a key may legitimately be absent (partial extraction is valid and common).

#### ValidationReport must classify failures for fallback-granularity

The report returns a list of issues, each tagged with its **fallback scope**, because the runtime needs to know the smallest safe unit to replace:

- **`field`** — property type mismatch or unknown property on an otherwise well-formed node. Safe fallback unit: re-extract that one property. Compiled path keeps the rest of the node. *(Enum-membership miss is deferred from #76's first landing — `ResolvedProperty` does not carry enum value lists today. Once that field is added upstream, enum-miss becomes a future `field`-scope failure.)*
- **`node`** — missing key, malformed `node_id`, unknown `entity_name`. Safe fallback unit: re-extract the whole node (and any edges referencing it by `node_id`). Cannot recover at field level because the node's identity is broken.
- **`edge`** — unresolved `from_node_id` / `to_node_id`, missing endpoint key, wrong endpoint entity type. Safe fallback unit: re-extract the whole edge. Cannot recover at field level because the endpoints define the edge's identity.
- **`event`** — the compiled extractor's entire output for this event is structurally invalid (zero nodes when the event type guarantees at least one, or every emitted node/edge failed). Safe fallback unit: re-run the entire event through the fallback extractor.

Runtime consumes `ValidationReport.failures` by scope tag and invokes the smallest-unit fallback. Per-field fallback on a type mismatch does not trigger a whole-event LLM call; a missing node's key does.

**Why this is a prerequisite**: the per-field fallback model in the original proposal assumed "Pydantic validation miss → LLM fallback for that field." With `value: Any`, Pydantic never misses. There's nothing to fall back on. The validator is what makes per-field fallback meaningful — and, per the granularity model above, also what makes per-node / per-edge / per-event fallback meaningful. The validator is useful independently (validating any extraction, LLM or deterministic, against the ontology). **P0.1 ships as its own issue before any compilation work starts.**

### P0.2 — Runtime execution target decision

The current `AI.GENERATE` path runs **inside BigQuery as server-side SQL**. Any compiled extractor has to execute somewhere. Three options, with real tradeoffs:

| Option | Execution location | Latency | Cost model | Complexity |
|---|---|---|---|---|
| **A. Client-side Python** | SDK process pulls events back, runs extractor, re-writes results | BQ round-trip per extraction batch | No BQ LLM cost; client compute time | Lowest — plain Python |
| **B. BigQuery Remote Function** | Extractor wrapped as a deployed Cloud Run endpoint; called from SQL | In-SQL, one HTTP hop per row/batch | Cloud Run cost + deploy surface | Highest — deploy pipeline, IAM, versioning |
| **C. BigQuery SQL / JavaScript UDF** | Generated extractor compiled to SQL + UDFs; runs entirely in BQ | In-SQL, no network hop | Slot time only | Middle — UDF translation layer + SQL testability |

This is a design decision the epic must settle before Phase 1 begins, because the "compile target language" changes fundamentally across the three. Recommendation, subject to review:

- **Phase 1**: Option A (client-side Python). Fastest to build, easiest to test, matches the pattern already used by `structured_extraction.py`. Accepts a round-trip cost because Phase 1 traces are already being fetched client-side for materialization anyway.
- **Phase 2 (if Phase 1 validates)**: evaluate Option C for the BQ-native path, keeping Option A as the default. Option C unlocks running compiled extractors inside the same SQL that produces the graph tables — matches the current `AI.GENERATE`-in-SQL pattern and removes the round-trip.
- **Option B stays off the table** unless there's a concrete user need for it — the deploy surface is disproportionate to the problem.

**Why this is a prerequisite**: "emit a Python bundle" is only one of three plausible answers and the original proposal silently chose it. The tradeoffs are large enough that the choice needs to be explicit.

## Phase 1 — compile deterministic structured extractors from known event schemas

Narrowed from the original scope. **Not** touching `ontology_graph.py`'s session-aggregated path or `context_graph.py`'s free-text `LLM_RESPONSE` extraction yet. Phase 1 targets **the existing `structured_extraction.py` registry** as the expansion point.

### Phase 1 value proposition

Phase 1 does **not** reduce runtime token cost compared to the current hand-written registry, because both are already zero-token deterministic Python. The `AI.GENERATE` baseline is only relevant as a ground-truth comparison and as the Phase-0 fallback when no hand-written extractor exists.

What Phase 1 actually delivers:

- **Authoring scale** — writing a new structured extractor goes from "write + review a Python function" to "declare the event-payload shape and extraction rules, let the compiler emit the function." The LLM's productivity benefit moves from runtime to author time, and stays one-off rather than per-event.
- **Safety checks applied uniformly** — every compiled extractor runs through the AST validator + ontology validator before acceptance. Hand-written extractors today rely on reviewer diligence; compiled extractors can't skip the gate.
- **Reproducibility and fingerprinted provenance** — compiled bundles carry a deterministic hash of their inputs (see the Fingerprint section). Two compile runs on the same inputs produce byte-identical code; drift is detected, not assumed.
- **Shared validation infrastructure** — the smoke-test harness, the ontology validator, and the revalidation job built for Phase 1 are reused verbatim in Phase 2. Phase 1's infrastructure investment is what makes Phase 2 tractable.

**Token-cost reporting is still wired in**, because it's the measurement that matters for the `AI.GENERATE` fallback path and for Phase 2's session-aggregated target. Phase 1 just doesn't *lead* with a token-cost claim.

### What gets compiled

The structured extractors that today live as hand-written Python functions (`extract_bka_decision_event` and siblings). The **input** is a well-defined event payload shape (fields in `content`, `attributes`, `content_parts` carrying typed values). The **output** is typed `ExtractedNode` / `ExtractedEdge` instances with ontology-valid property values. This is the profile the paper's evidence covers cleanly.

### Compile phase (runs once per `(ontology, binding, event_schema, extraction_rules, compiler_version)`)

1. **Input**: the ontology YAML, binding YAML, event schema (which `event_type` values and their expected payload shapes), per-event extraction rules (`event_type X → entity Y with properties from fields {...}`), and a sample of ≥ 100 real events per covered `event_type`.
2. **LLM step**: prompt the LLM to fill validated Jinja templates — one template per field-kind supported by ontology v0: **scalar** (`string`, `int64`, `float64`, `bool`, `date`, `datetime`, `timestamp`, `time`, `bytes`, `numeric`, `decimal`), **json** (opaque structured blob), and **reference** (to another entity's key). Arrays and structs are explicitly deferred in ontology v0 (`docs/ontology/ontology.md:293`) and are modeled as separate entities + relationships, not as nested properties — so Phase 1 templates do not cover them. Enums are represented via `json` plus an in-template membership check against a declared value list; if the ontology model later grows a first-class `enum` type, it gets its own template then. Generation is constrained; the LLM cannot emit free-form code, only field-level template fills.

The distinction between the ontology model's field kinds (constrained by ontology v0) and the **event payload shape** (which may have richer structure, e.g., arrays of tool-result objects in a `TOOL_COMPLETED` event's `content`) is handled separately. Event-payload arrays are flattened into repeated `ExtractedEdge` or `ExtractedNode` emissions by the extraction rules, not carried through as array properties. `event_schema` and `extraction_rules` are the place to declare this shape; they are a separate input to the compiler, not part of the ontology.
3. **Static check**: AST-validate the emitted code. Reject anything that's not pure-Python, reads from unexpected fields, calls out-of-allowlist functions, or has side effects.
4. **Smoke test**: run the generated extractor on the sample set. For each event, compare the generated extractor's output to a reference output produced by (a) the hand-written extractor, if one exists, or (b) `AI.GENERATE` with the same prompt as Phase-0 fallback. Accept iff field-level F1 ≥ threshold (threshold TBD per extractor type; proposed start: 0.95).
5. **Ontology validation**: run the full `validate_extracted_graph` from P0.1 over the smoke-test outputs. Any validator failure blocks acceptance.

If any of stages 3–5 fails, the compile run fails and the hand-written or `AI.GENERATE` extractor continues to be used.

### Runtime phase

- `run_structured_extractors()` loads compiled bundles from `compiled_extractors/<fingerprint>/` if a bundle matches the active `(ontology, binding, event_schema, compiler_version)`. Otherwise falls back to the hand-written registry entries.
- Per-field fallback (now meaningful because P0.1 ships): if a compiled extractor produces a node where `validate_extracted_graph` flags a field, that field's value is replaced by the hand-written or LLM-based extraction result for that field only. Logged with a trace-shape signature so the next compile run can cover it.
- **C1** does not touch `ontology_graph.py` or `context_graph.py`. The compile harness, smoke tests, and measurement report are self-contained.
- **C2** plugs compiled-bundle loading into the existing `run_structured_extractors()` hook (`structured_extraction.py:198–232`). The pruning + hint behavior at `ontology_graph.py:540–571` is reused unchanged — compiled extractors emit `StructuredExtractionResult` with the right `fully_handled_span_ids` / `partially_handled_span_ids` partitioning, and the existing AI SQL semantics (`_EXTRACT_ONTOLOGY_AI_QUERY` at `ontology_graph.py:90–136`) are unchanged. C2's only change to the AI side is what's *passed* in the `excluded_span_ids` / `partial_span_ids` parameters, not how the query itself is built.

### Fingerprint (expanded per review)

`sha256(ontology, binding, event_schema, event_allowlist, transcript_builder_version, content_serialization_rules, extraction_rules, template_version, compiler_package_version)`

Changes in the trace-shape dependencies invalidate the bundle just as surely as changes in the ontology do. Stale bundles refuse to load and the runtime falls back to hand-written / LLM extraction until `gm compile-extractors` reruns.

### Measured outcomes before Phase 2 proceeds

Phase 1 must produce, against a reference ontology + real trace corpus:

- **Extractor F1** per compiled extractor, vs hand-written and vs `AI.GENERATE`. Measured at the field level on the smoke-test corpus.
- **Per-event extractor latency** for the deterministic compiled path. Measured locally; this is the compile-time-replacing path, not the AI billing path.
- **Fallback rate** per event type — how often validation-gated fallback kicks in (FIELD / NODE / EDGE / EVENT scope per #76).
- **C2-only (deferred to C2 measurement)**: session-level prompt-size estimates (transcript chars before/after pruning) and job-level BigQuery stats (`JOBS_BY_PROJECT.total_bytes_processed`, `total_slot_ms`) for the build's BQ jobs. Per-event token attribution is not available — billing is session-aggregated and the AI query (`ontology_graph.py:90–136`) returns only `(session_id, graph_json)` with no per-row usage metadata. (Aligned with #107's billing-honest column set.)
- Rate of per-field fallbacks actually triggered on a holdout trace set.

**If F1 < 0.95 or fallback rate > 10%, Phase 2 does not proceed.** The structured-extractor compilation has to beat or match the hand-written baseline before taking on the harder free-text case.

## Phase 2 — compile session-aggregated ontology-graph extractors (only after Phase 1 validates)

Only begun after Phase 1's measurements show the compile→validate loop works. Targets `ontology_graph.extract_graph()`'s session-aggregated `AI.GENERATE` path. This is the tier where the paper's "57× token reduction" has the most direct mapping, but it also has more open-ended inputs (the session transcript is not a known schema the way an individual event payload is).

Phase 2 scope is intentionally deferred until Phase 1 data is in. If Phase 1 shows deterministic extractors can match `AI.GENERATE` on structured events with > 0.95 F1 and < 10% fallback, Phase 2 can reuse the compiler infrastructure. If not, Phase 2 doesn't happen and the epic's scope contracts to Phase 1 only.

The runtime-target decision from P0.2 is re-evaluated here. Option A (client-side Python) may no longer be acceptable at session scale; Option C (SQL UDF) becomes the likely target, since session-aggregated extraction currently runs in SQL.

## Phase 3 — free-text extraction (explicitly out of scope)

`context_graph.py`'s row-level `AI.GENERATE` extraction of business entities from `LLM_RESPONSE` text is **not** proposed for compilation in this epic. The paper's evidence doesn't cover this profile, and the review correctly flagged that deterministic generated code is much less convincing there.

A separate issue may propose structured-NLU approaches (entity recognizers, intent classifiers) for that path later. This epic does not commit to it.

## Revalidation harness (shared across phases once compilation lands)

Scheduled or on-demand job:

1. Sample N recent events matching a covered event schema.
2. Run both the compiled extractor and a reference path (hand-written if one exists, else `AI.GENERATE`).
3. Report agreement rate + per-field disagreement table.
4. When agreement drops below threshold, surface a recompile recommendation in the SDK's health check: *"Compiled extractor for event_type X agreement dropped to 87% over the last 500 events. Recompile recommended."*

No auto-recompile — the compile-time LLM call is the trust boundary. Human decision to re-run, backed by the measurement.

## Risks + mitigations (revised)

- **Novel event-payload shapes.** Addressed per-field, now meaningful because of P0.1. Revalidation harness catches distributional drift.
- **Fingerprint too narrow.** Expanded per review to cover trace-shape dependencies; stale bundles fail loud.
- **Runtime target changes between phases.** P0.2 decision is reviewed again at Phase 2. Client-side Python for Phase 1 keeps the commitment small.
- **Phase 1 result doesn't support Phase 2.** Epic scope contracts to Phase 1 only. No sunk-cost pressure to force compilation into the session-aggregated or free-text paths.
- **Compile cost.** Front-loaded, amortized. Uses the unified token-budget config from [#69](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/issues/69).
- **Debugging.** Generated bundles are Python; stack traces point at the template that produced the bad extractor. Checked into the repo alongside the ontology or stored as a versioned sidecar dataset.

## Open questions

1. **P0.1 first, or in parallel with Phase 1 scaffolding?** Proposal: P0.1 first as a standalone PR, because it's useful independently (validates any extraction output) and unblocks meaningful discussion of the Phase 1 fallback semantics.
2. **Phase 1 F1 threshold.** Proposed 0.95 as the accept bar. Too strict if hand-written baselines themselves don't hit 0.95 on production traces? Worth measuring hand-written baselines first.
3. **Which event types get compiled first in Phase 1.** Today `structured_extraction.py` has exactly one hand-written extractor: `extract_bka_decision_event` at [structured_extraction.py:120](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/blob/main/src/bigquery_agent_analytics/structured_extraction.py#L120). The first implementation slice is therefore **"compile `extract_bka_decision_event` first,"** with the hand-written version as the F1 ground truth for its own compiled replacement. Before promoting compilation to a general solution, add one or two more hand-written extractors (likely `TOOL_COMPLETED` result shapes and a HITL event) as hand-written baselines, so the smoke-test harness has multi-extractor coverage and the F1 metric isn't a single-point measurement.
4. **Where compiled bundles live.** Checked into the SDK-using repo next to `ontology.yaml` (auditable, reviewable), emitted as a versioned BQ table (runtime-discoverable), or both. Leaning both — in-repo file is source of truth, BQ-table mirror is for runtime discovery.
5. **Revalidation cadence.** Scheduled (daily / weekly) vs on-demand vs triggered by the SDK's doctor check?

## Related work in-repo

- [#57](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/issues/57) — SKOS import support. Phase 1 compiled extractors must cover SKOS-derived abstract entities + `skos_`-prefixed relationships as part of the template set.
- [#58](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/issues/58) — Runtime entity-resolution primitives. Shares the fingerprint-versioned-artifact pattern with this epic; compiled extractor bundles should live under the same provenance contract (compile-id in a sidecar table).
- [#69](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/issues/69) — LLM judger improvements. The unified token-budget config proposed there applies to the compile-time LLM call here. The "compile rubrics into deterministic sub-checks" direction for judges is a parallel application of the same idea, not blocked by this epic.

## Reference

- Trooskens G., Karlsberg A., Sharma A., De Brouwer L., Van Puyvelde M., Young M., Thickstun J., Alterovitz G., De Brouwer W. A. *Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation*. arXiv:2604.05150 (2026). https://arxiv.org/abs/2604.05150




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Compile-time code generation for structured trace extractors (scoped rework) #75

Epic: Compile-time code generation for structured trace extractors

Motivation

Current architecture (accurate characterization)

Prerequisites (Phase 0) — must land before any compilation work

P0.1 — Ontology-aware validator (tracked at #76; this section is a summary, #76 is the source of truth)

ValidationReport must classify failures for fallback-granularity

P0.2 — Runtime execution target decision

Phase 1 — compile deterministic structured extractors from known event schemas

Phase 1 value proposition

What gets compiled

Compile phase (runs once per `(ontology, binding, event_schema, extraction_rules, compiler_version)`)

Runtime phase

Fingerprint (expanded per review)

Measured outcomes before Phase 2 proceeds

Phase 2 — compile session-aggregated ontology-graph extractors (only after Phase 1 validates)

Phase 3 — free-text extraction (explicitly out of scope)

Revalidation harness (shared across phases once compilation lands)

Risks + mitigations (revised)

Open questions

Related work in-repo

Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Option	Execution location	Latency	Cost model	Complexity
A. Client-side Python	SDK process pulls events back, runs extractor, re-writes results	BQ round-trip per extraction batch	No BQ LLM cost; client compute time	Lowest — plain Python
B. BigQuery Remote Function	Extractor wrapped as a deployed Cloud Run endpoint; called from SQL	In-SQL, one HTTP hop per row/batch	Cloud Run cost + deploy surface	Highest — deploy pipeline, IAM, versioning
C. BigQuery SQL / JavaScript UDF	Generated extractor compiled to SQL + UDFs; runs entirely in BQ	In-SQL, no network hop	Slot time only	Middle — UDF translation layer + SQL testability

Epic: Compile-time code generation for structured trace extractors (scoped rework) #75

Description

Epic: Compile-time code generation for structured trace extractors

Motivation

Current architecture (accurate characterization)

Prerequisites (Phase 0) — must land before any compilation work

P0.1 — Ontology-aware validator (tracked at #76; this section is a summary, #76 is the source of truth)

ValidationReport must classify failures for fallback-granularity

P0.2 — Runtime execution target decision

Phase 1 — compile deterministic structured extractors from known event schemas

Phase 1 value proposition

What gets compiled

Compile phase (runs once per (ontology, binding, event_schema, extraction_rules, compiler_version))

Runtime phase

Fingerprint (expanded per review)

Measured outcomes before Phase 2 proceeds

Phase 2 — compile session-aggregated ontology-graph extractors (only after Phase 1 validates)

Phase 3 — free-text extraction (explicitly out of scope)

Revalidation harness (shared across phases once compilation lands)

Risks + mitigations (revised)

Open questions

Related work in-repo

Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Compile phase (runs once per `(ontology, binding, event_schema, extraction_rules, compiler_version)`)