You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Epic: Compile-time code generation for structured trace extractors
Scope revision (based on review feedback, 2026-04-23): the original proposal here treated this as a single implementation. That was too broad. Review found (a) the per-field fallback depended on ontology-level validation that doesn't exist today, (b) the runtime execution target was underspecified — current extraction runs inside BigQuery as server-side AI.GENERATE, not client-side Python, and emitting Python bundles is an architectural choice, not an implementation detail, (c) the proposal overgeneralized the paper's wins on structured workflow tasks to free-text extraction from arbitrary LLM_RESPONSE prose, and (d) the "every trace → LLM call" framing was imprecise — ontology_graph.py already aggregates events per session before the LLM call, only context_graph.py's BizNode path is row-level. Reworking as an epic with scoped phases, prerequisites, and an explicit Phase 1 limited to compiling deterministic extractors from known structured event schemas (where the paper's evidence maps cleanly). Runtime AI.GENERATE stays as the semantic fallback until Phase 1 has measured precision/recall on real traces.
Motivation
arXiv:2604.05150 — Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation (Trooskens et al., 2026, submitted April 6, 2026) treats the LLM as a compile-time code generator whose output is constrained to fill validated templates. Reported results on comparable tasks: 96% completion with zero execution tokens on function-calling (BFCL, n=400), 57× token reduction at 1,000 transactions, 80.4% accuracy on structured document extraction (DocILE, n=5,680).
The paper's evidence is strongest on structured-to-structured transformations — the target schema is known, the input has exploitable shape, and the task repeats at scale. That's exactly the profile of the SDK's structured event extractors (structured_extraction.py). It is not the profile of free-text semantic extraction from LLM_RESPONSE prose, which this issue no longer tries to compile in Phase 1.
Current architecture (accurate characterization)
Three distinct extraction paths, with different costs:
context_graph.py:299 — row-level BQ-native AI.GENERATE inside SQL/MERGE, invoked per row to pull business entities from LLM_RESPONSE text. This is the expensive, open-ended, free-text path.
ontology_graph.py:100, 631 — session-aggregated BQ-native AI.GENERATE. Events for a session are assembled into a transcript inside SQL, then a single AI.GENERATE call produces JSON shaped by the compiled ontology schema. One LLM call per session, not per row — already amortized. This path is what the paper's evidence most-closely matches, but it's also the path where the validator and runtime-target prerequisites bite.
structured_extraction.py — pure-Python structured extractors. A registry of typed extractors (e.g., extract_bka_decision_event) that convert specific event shapes into ExtractedNode / ExtractedEdge without calling an LLM at all. Already deterministic. This is the natural Phase 1 expansion point.
The SDK isn't uniformly "one LLM call per trace." It's a three-tier hierarchy — deterministic registry → session-aggregated LLM → row-level LLM — and the right compilation target is the middle tier, only after the prerequisites below are in place.
Prerequisites (Phase 0) — must land before any compilation work
P0.1 — Ontology-aware validator (tracked at #76; this section is a summary, #76 is the source of truth)
Status (2026-05-02): P0.1 is now scoped at #76. The signature and key-validation scope below have shifted since this section was written. Read #76 for the current contract; the bullets below are kept for historical context only.
Signature: validate_extracted_graph(spec: ResolvedGraph, graph: ExtractedGraph) — ResolvedGraph is the runtime-facing spec surface (per resolved_spec.py), not OntologySpec. A thin validate_extracted_graph_from_ontology(ontology, binding, graph) adapter is provided for callers holding upstream Ontology + Binding instead.
Key-validation scope: primary keys + edge endpoint keys only for first landing. Alternate-key validation is deferred because ResolvedEntity doesn't currently carry resolved alternate-key metadata (it would need a new alternate_key_columns field built by resolve()). Tracked as a follow-up.
Enum membership is also deferred for the same reason — ResolvedProperty carries no enum value list today.
extracted_models.py:18 defines ExtractedProperty.value: Any. ExtractedGraph validates container shape (nodes/edges are lists, keys present) but not ontology correctness: unknown field names survive, type mismatches pass, unknown entity types aren't rejected. ontology_materializer.py:263 silently drops unknown fields and lets missing edge keys become empty strings.
Every ExtractedNode.entity_name matches a declared entity in the spec.
Every ExtractedProperty.name on a node or edge exists on that entity/relationship in the spec, and value satisfies the declared type (per ontology v0: scalar-ish types plus json; arrays and structs are explicitly deferred in docs/ontology/ontology.md:293 and should be modeled as separate entities + relationships, not as nested properties).
Every edge's from_node_id and to_node_id resolve to nodes in the graph or to external node-refs matching the declared endpoint entity.
Required keys are present. Per Feat: Ontology-aware validate_extracted_graph with fallback-scope classification (prerequisite for #75) #76's first-landing scope, "required" means entity primary keys (from ResolvedEntity.key_columns) and edge endpoint keys (from ResolvedRelationship.from_columns / to_columns) only — not alternate keys, not every declared property. A property that isn't a key may legitimately be absent (partial extraction is valid and common).
ValidationReport must classify failures for fallback-granularity
The report returns a list of issues, each tagged with its fallback scope, because the runtime needs to know the smallest safe unit to replace:
field — property type mismatch or unknown property on an otherwise well-formed node. Safe fallback unit: re-extract that one property. Compiled path keeps the rest of the node. (Enum-membership miss is deferred from Feat: Ontology-aware validate_extracted_graph with fallback-scope classification (prerequisite for #75) #76's first landing — ResolvedProperty does not carry enum value lists today. Once that field is added upstream, enum-miss becomes a future field-scope failure.)
node — missing key, malformed node_id, unknown entity_name. Safe fallback unit: re-extract the whole node (and any edges referencing it by node_id). Cannot recover at field level because the node's identity is broken.
edge — unresolved from_node_id / to_node_id, missing endpoint key, wrong endpoint entity type. Safe fallback unit: re-extract the whole edge. Cannot recover at field level because the endpoints define the edge's identity.
event — the compiled extractor's entire output for this event is structurally invalid (zero nodes when the event type guarantees at least one, or every emitted node/edge failed). Safe fallback unit: re-run the entire event through the fallback extractor.
Runtime consumes ValidationReport.failures by scope tag and invokes the smallest-unit fallback. Per-field fallback on a type mismatch does not trigger a whole-event LLM call; a missing node's key does.
Why this is a prerequisite: the per-field fallback model in the original proposal assumed "Pydantic validation miss → LLM fallback for that field." With value: Any, Pydantic never misses. There's nothing to fall back on. The validator is what makes per-field fallback meaningful — and, per the granularity model above, also what makes per-node / per-edge / per-event fallback meaningful. The validator is useful independently (validating any extraction, LLM or deterministic, against the ontology). P0.1 ships as its own issue before any compilation work starts.
P0.2 — Runtime execution target decision
The current AI.GENERATE path runs inside BigQuery as server-side SQL. Any compiled extractor has to execute somewhere. Three options, with real tradeoffs:
Option
Execution location
Latency
Cost model
Complexity
A. Client-side Python
SDK process pulls events back, runs extractor, re-writes results
BQ round-trip per extraction batch
No BQ LLM cost; client compute time
Lowest — plain Python
B. BigQuery Remote Function
Extractor wrapped as a deployed Cloud Run endpoint; called from SQL
In-SQL, one HTTP hop per row/batch
Cloud Run cost + deploy surface
Highest — deploy pipeline, IAM, versioning
C. BigQuery SQL / JavaScript UDF
Generated extractor compiled to SQL + UDFs; runs entirely in BQ
In-SQL, no network hop
Slot time only
Middle — UDF translation layer + SQL testability
This is a design decision the epic must settle before Phase 1 begins, because the "compile target language" changes fundamentally across the three. Recommendation, subject to review:
Phase 1: Option A (client-side Python). Fastest to build, easiest to test, matches the pattern already used by structured_extraction.py. Accepts a round-trip cost because Phase 1 traces are already being fetched client-side for materialization anyway.
Phase 2 (if Phase 1 validates): evaluate Option C for the BQ-native path, keeping Option A as the default. Option C unlocks running compiled extractors inside the same SQL that produces the graph tables — matches the current AI.GENERATE-in-SQL pattern and removes the round-trip.
Option B stays off the table unless there's a concrete user need for it — the deploy surface is disproportionate to the problem.
Why this is a prerequisite: "emit a Python bundle" is only one of three plausible answers and the original proposal silently chose it. The tradeoffs are large enough that the choice needs to be explicit.
Phase 1 — compile deterministic structured extractors from known event schemas
Narrowed from the original scope. Not touching ontology_graph.py's session-aggregated path or context_graph.py's free-text LLM_RESPONSE extraction yet. Phase 1 targets the existing structured_extraction.py registry as the expansion point.
Phase 1 value proposition
Phase 1 does not reduce runtime token cost compared to the current hand-written registry, because both are already zero-token deterministic Python. The AI.GENERATE baseline is only relevant as a ground-truth comparison and as the Phase-0 fallback when no hand-written extractor exists.
What Phase 1 actually delivers:
Authoring scale — writing a new structured extractor goes from "write + review a Python function" to "declare the event-payload shape and extraction rules, let the compiler emit the function." The LLM's productivity benefit moves from runtime to author time, and stays one-off rather than per-event.
Safety checks applied uniformly — every compiled extractor runs through the AST validator + ontology validator before acceptance. Hand-written extractors today rely on reviewer diligence; compiled extractors can't skip the gate.
Reproducibility and fingerprinted provenance — compiled bundles carry a deterministic hash of their inputs (see the Fingerprint section). Two compile runs on the same inputs produce byte-identical code; drift is detected, not assumed.
Shared validation infrastructure — the smoke-test harness, the ontology validator, and the revalidation job built for Phase 1 are reused verbatim in Phase 2. Phase 1's infrastructure investment is what makes Phase 2 tractable.
Token-cost reporting is still wired in, because it's the measurement that matters for the AI.GENERATE fallback path and for Phase 2's session-aggregated target. Phase 1 just doesn't lead with a token-cost claim.
What gets compiled
The structured extractors that today live as hand-written Python functions (extract_bka_decision_event and siblings). The input is a well-defined event payload shape (fields in content, attributes, content_parts carrying typed values). The output is typed ExtractedNode / ExtractedEdge instances with ontology-valid property values. This is the profile the paper's evidence covers cleanly.
Compile phase (runs once per (ontology, binding, event_schema, extraction_rules, compiler_version))
Input: the ontology YAML, binding YAML, event schema (which event_type values and their expected payload shapes), per-event extraction rules (event_type X → entity Y with properties from fields {...}), and a sample of ≥ 100 real events per covered event_type.
LLM step: prompt the LLM to fill validated Jinja templates — one template per field-kind supported by ontology v0: scalar (string, int64, float64, bool, date, datetime, timestamp, time, bytes, numeric, decimal), json (opaque structured blob), and reference (to another entity's key). Arrays and structs are explicitly deferred in ontology v0 (docs/ontology/ontology.md:293) and are modeled as separate entities + relationships, not as nested properties — so Phase 1 templates do not cover them. Enums are represented via json plus an in-template membership check against a declared value list; if the ontology model later grows a first-class enum type, it gets its own template then. Generation is constrained; the LLM cannot emit free-form code, only field-level template fills.
The distinction between the ontology model's field kinds (constrained by ontology v0) and the event payload shape (which may have richer structure, e.g., arrays of tool-result objects in a TOOL_COMPLETED event's content) is handled separately. Event-payload arrays are flattened into repeated ExtractedEdge or ExtractedNode emissions by the extraction rules, not carried through as array properties. event_schema and extraction_rules are the place to declare this shape; they are a separate input to the compiler, not part of the ontology.
3. Static check: AST-validate the emitted code. Reject anything that's not pure-Python, reads from unexpected fields, calls out-of-allowlist functions, or has side effects.
4. Smoke test: run the generated extractor on the sample set. For each event, compare the generated extractor's output to a reference output produced by (a) the hand-written extractor, if one exists, or (b) AI.GENERATE with the same prompt as Phase-0 fallback. Accept iff field-level F1 ≥ threshold (threshold TBD per extractor type; proposed start: 0.95).
5. Ontology validation: run the full validate_extracted_graph from P0.1 over the smoke-test outputs. Any validator failure blocks acceptance.
If any of stages 3–5 fails, the compile run fails and the hand-written or AI.GENERATE extractor continues to be used.
Runtime phase
run_structured_extractors() loads compiled bundles from compiled_extractors/<fingerprint>/ if a bundle matches the active (ontology, binding, event_schema, compiler_version). Otherwise falls back to the hand-written registry entries.
Per-field fallback (now meaningful because P0.1 ships): if a compiled extractor produces a node where validate_extracted_graph flags a field, that field's value is replaced by the hand-written or LLM-based extraction result for that field only. Logged with a trace-shape signature so the next compile run can cover it.
C1 does not touch ontology_graph.py or context_graph.py. The compile harness, smoke tests, and measurement report are self-contained.
C2 plugs compiled-bundle loading into the existing run_structured_extractors() hook (structured_extraction.py:198–232). The pruning + hint behavior at ontology_graph.py:540–571 is reused unchanged — compiled extractors emit StructuredExtractionResult with the right fully_handled_span_ids / partially_handled_span_ids partitioning, and the existing AI SQL semantics (_EXTRACT_ONTOLOGY_AI_QUERY at ontology_graph.py:90–136) are unchanged. C2's only change to the AI side is what's passed in the excluded_span_ids / partial_span_ids parameters, not how the query itself is built.
Changes in the trace-shape dependencies invalidate the bundle just as surely as changes in the ontology do. Stale bundles refuse to load and the runtime falls back to hand-written / LLM extraction until gm compile-extractors reruns.
Measured outcomes before Phase 2 proceeds
Phase 1 must produce, against a reference ontology + real trace corpus:
Extractor F1 per compiled extractor, vs hand-written and vs AI.GENERATE. Measured at the field level on the smoke-test corpus.
Per-event extractor latency for the deterministic compiled path. Measured locally; this is the compile-time-replacing path, not the AI billing path.
C2-only (deferred to C2 measurement): session-level prompt-size estimates (transcript chars before/after pruning) and job-level BigQuery stats (JOBS_BY_PROJECT.total_bytes_processed, total_slot_ms) for the build's BQ jobs. Per-event token attribution is not available — billing is session-aggregated and the AI query (ontology_graph.py:90–136) returns only (session_id, graph_json) with no per-row usage metadata. (Aligned with Doc: V5 migration notebook storyboard for the four-guarantee decision-lineage narrative #107's billing-honest column set.)
Rate of per-field fallbacks actually triggered on a holdout trace set.
If F1 < 0.95 or fallback rate > 10%, Phase 2 does not proceed. The structured-extractor compilation has to beat or match the hand-written baseline before taking on the harder free-text case.
Only begun after Phase 1's measurements show the compile→validate loop works. Targets ontology_graph.extract_graph()'s session-aggregated AI.GENERATE path. This is the tier where the paper's "57× token reduction" has the most direct mapping, but it also has more open-ended inputs (the session transcript is not a known schema the way an individual event payload is).
Phase 2 scope is intentionally deferred until Phase 1 data is in. If Phase 1 shows deterministic extractors can match AI.GENERATE on structured events with > 0.95 F1 and < 10% fallback, Phase 2 can reuse the compiler infrastructure. If not, Phase 2 doesn't happen and the epic's scope contracts to Phase 1 only.
The runtime-target decision from P0.2 is re-evaluated here. Option A (client-side Python) may no longer be acceptable at session scale; Option C (SQL UDF) becomes the likely target, since session-aggregated extraction currently runs in SQL.
Phase 3 — free-text extraction (explicitly out of scope)
context_graph.py's row-level AI.GENERATE extraction of business entities from LLM_RESPONSE text is not proposed for compilation in this epic. The paper's evidence doesn't cover this profile, and the review correctly flagged that deterministic generated code is much less convincing there.
A separate issue may propose structured-NLU approaches (entity recognizers, intent classifiers) for that path later. This epic does not commit to it.
Revalidation harness (shared across phases once compilation lands)
Scheduled or on-demand job:
Sample N recent events matching a covered event schema.
Run both the compiled extractor and a reference path (hand-written if one exists, else AI.GENERATE).
When agreement drops below threshold, surface a recompile recommendation in the SDK's health check: "Compiled extractor for event_type X agreement dropped to 87% over the last 500 events. Recompile recommended."
No auto-recompile — the compile-time LLM call is the trust boundary. Human decision to re-run, backed by the measurement.
Risks + mitigations (revised)
Novel event-payload shapes. Addressed per-field, now meaningful because of P0.1. Revalidation harness catches distributional drift.
Fingerprint too narrow. Expanded per review to cover trace-shape dependencies; stale bundles fail loud.
Runtime target changes between phases. P0.2 decision is reviewed again at Phase 2. Client-side Python for Phase 1 keeps the commitment small.
Phase 1 result doesn't support Phase 2. Epic scope contracts to Phase 1 only. No sunk-cost pressure to force compilation into the session-aggregated or free-text paths.
Compile cost. Front-loaded, amortized. Uses the unified token-budget config from #69.
Debugging. Generated bundles are Python; stack traces point at the template that produced the bad extractor. Checked into the repo alongside the ontology or stored as a versioned sidecar dataset.
Open questions
P0.1 first, or in parallel with Phase 1 scaffolding? Proposal: P0.1 first as a standalone PR, because it's useful independently (validates any extraction output) and unblocks meaningful discussion of the Phase 1 fallback semantics.
Phase 1 F1 threshold. Proposed 0.95 as the accept bar. Too strict if hand-written baselines themselves don't hit 0.95 on production traces? Worth measuring hand-written baselines first.
Which event types get compiled first in Phase 1. Today structured_extraction.py has exactly one hand-written extractor: extract_bka_decision_event at structured_extraction.py:120. The first implementation slice is therefore "compile extract_bka_decision_event first," with the hand-written version as the F1 ground truth for its own compiled replacement. Before promoting compilation to a general solution, add one or two more hand-written extractors (likely TOOL_COMPLETED result shapes and a HITL event) as hand-written baselines, so the smoke-test harness has multi-extractor coverage and the F1 metric isn't a single-point measurement.
Where compiled bundles live. Checked into the SDK-using repo next to ontology.yaml (auditable, reviewable), emitted as a versioned BQ table (runtime-discoverable), or both. Leaning both — in-repo file is source of truth, BQ-table mirror is for runtime discovery.
Revalidation cadence. Scheduled (daily / weekly) vs on-demand vs triggered by the SDK's doctor check?
Related work in-repo
#57 — SKOS import support. Phase 1 compiled extractors must cover SKOS-derived abstract entities + skos_-prefixed relationships as part of the template set.
#58 — Runtime entity-resolution primitives. Shares the fingerprint-versioned-artifact pattern with this epic; compiled extractor bundles should live under the same provenance contract (compile-id in a sidecar table).
#69 — LLM judger improvements. The unified token-budget config proposed there applies to the compile-time LLM call here. The "compile rubrics into deterministic sub-checks" direction for judges is a parallel application of the same idea, not blocked by this epic.
Reference
Trooskens G., Karlsberg A., Sharma A., De Brouwer L., Van Puyvelde M., Young M., Thickstun J., Alterovitz G., De Brouwer W. A. Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation. arXiv:2604.05150 (2026). https://arxiv.org/abs/2604.05150
Epic: Compile-time code generation for structured trace extractors
Motivation
arXiv:2604.05150 — Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation (Trooskens et al., 2026, submitted April 6, 2026) treats the LLM as a compile-time code generator whose output is constrained to fill validated templates. Reported results on comparable tasks: 96% completion with zero execution tokens on function-calling (BFCL, n=400), 57× token reduction at 1,000 transactions, 80.4% accuracy on structured document extraction (DocILE, n=5,680).
The paper's evidence is strongest on structured-to-structured transformations — the target schema is known, the input has exploitable shape, and the task repeats at scale. That's exactly the profile of the SDK's structured event extractors (
structured_extraction.py). It is not the profile of free-text semantic extraction fromLLM_RESPONSEprose, which this issue no longer tries to compile in Phase 1.Current architecture (accurate characterization)
Three distinct extraction paths, with different costs:
context_graph.py:299— row-level BQ-nativeAI.GENERATEinside SQL/MERGE, invoked per row to pull business entities fromLLM_RESPONSEtext. This is the expensive, open-ended, free-text path.ontology_graph.py:100, 631— session-aggregated BQ-nativeAI.GENERATE. Events for a session are assembled into a transcript inside SQL, then a singleAI.GENERATEcall produces JSON shaped by the compiled ontology schema. One LLM call per session, not per row — already amortized. This path is what the paper's evidence most-closely matches, but it's also the path where the validator and runtime-target prerequisites bite.structured_extraction.py— pure-Python structured extractors. A registry of typed extractors (e.g.,extract_bka_decision_event) that convert specific event shapes intoExtractedNode/ExtractedEdgewithout calling an LLM at all. Already deterministic. This is the natural Phase 1 expansion point.The SDK isn't uniformly "one LLM call per trace." It's a three-tier hierarchy — deterministic registry → session-aggregated LLM → row-level LLM — and the right compilation target is the middle tier, only after the prerequisites below are in place.
Prerequisites (Phase 0) — must land before any compilation work
P0.1 — Ontology-aware validator (tracked at #76; this section is a summary, #76 is the source of truth)
extracted_models.py:18definesExtractedProperty.value: Any.ExtractedGraphvalidates container shape (nodes/edges are lists, keys present) but not ontology correctness: unknown field names survive, type mismatches pass, unknown entity types aren't rejected.ontology_materializer.py:263silently drops unknown fields and lets missing edge keys become empty strings.The validator (per #76) checks:
ExtractedNode.entity_namematches a declared entity in the spec.ExtractedProperty.nameon a node or edge exists on that entity/relationship in the spec, andvaluesatisfies the declared type (per ontology v0: scalar-ish types plusjson; arrays and structs are explicitly deferred indocs/ontology/ontology.md:293and should be modeled as separate entities + relationships, not as nested properties).from_node_idandto_node_idresolve to nodes in the graph or to external node-refs matching the declared endpoint entity.ResolvedEntity.key_columns) and edge endpoint keys (fromResolvedRelationship.from_columns/to_columns) only — not alternate keys, not every declared property. A property that isn't a key may legitimately be absent (partial extraction is valid and common).ValidationReport must classify failures for fallback-granularity
The report returns a list of issues, each tagged with its fallback scope, because the runtime needs to know the smallest safe unit to replace:
field— property type mismatch or unknown property on an otherwise well-formed node. Safe fallback unit: re-extract that one property. Compiled path keeps the rest of the node. (Enum-membership miss is deferred from Feat: Ontology-aware validate_extracted_graph with fallback-scope classification (prerequisite for #75) #76's first landing —ResolvedPropertydoes not carry enum value lists today. Once that field is added upstream, enum-miss becomes a futurefield-scope failure.)node— missing key, malformednode_id, unknownentity_name. Safe fallback unit: re-extract the whole node (and any edges referencing it bynode_id). Cannot recover at field level because the node's identity is broken.edge— unresolvedfrom_node_id/to_node_id, missing endpoint key, wrong endpoint entity type. Safe fallback unit: re-extract the whole edge. Cannot recover at field level because the endpoints define the edge's identity.event— the compiled extractor's entire output for this event is structurally invalid (zero nodes when the event type guarantees at least one, or every emitted node/edge failed). Safe fallback unit: re-run the entire event through the fallback extractor.Runtime consumes
ValidationReport.failuresby scope tag and invokes the smallest-unit fallback. Per-field fallback on a type mismatch does not trigger a whole-event LLM call; a missing node's key does.Why this is a prerequisite: the per-field fallback model in the original proposal assumed "Pydantic validation miss → LLM fallback for that field." With
value: Any, Pydantic never misses. There's nothing to fall back on. The validator is what makes per-field fallback meaningful — and, per the granularity model above, also what makes per-node / per-edge / per-event fallback meaningful. The validator is useful independently (validating any extraction, LLM or deterministic, against the ontology). P0.1 ships as its own issue before any compilation work starts.P0.2 — Runtime execution target decision
The current
AI.GENERATEpath runs inside BigQuery as server-side SQL. Any compiled extractor has to execute somewhere. Three options, with real tradeoffs:This is a design decision the epic must settle before Phase 1 begins, because the "compile target language" changes fundamentally across the three. Recommendation, subject to review:
structured_extraction.py. Accepts a round-trip cost because Phase 1 traces are already being fetched client-side for materialization anyway.AI.GENERATE-in-SQL pattern and removes the round-trip.Why this is a prerequisite: "emit a Python bundle" is only one of three plausible answers and the original proposal silently chose it. The tradeoffs are large enough that the choice needs to be explicit.
Phase 1 — compile deterministic structured extractors from known event schemas
Narrowed from the original scope. Not touching
ontology_graph.py's session-aggregated path orcontext_graph.py's free-textLLM_RESPONSEextraction yet. Phase 1 targets the existingstructured_extraction.pyregistry as the expansion point.Phase 1 value proposition
Phase 1 does not reduce runtime token cost compared to the current hand-written registry, because both are already zero-token deterministic Python. The
AI.GENERATEbaseline is only relevant as a ground-truth comparison and as the Phase-0 fallback when no hand-written extractor exists.What Phase 1 actually delivers:
Token-cost reporting is still wired in, because it's the measurement that matters for the
AI.GENERATEfallback path and for Phase 2's session-aggregated target. Phase 1 just doesn't lead with a token-cost claim.What gets compiled
The structured extractors that today live as hand-written Python functions (
extract_bka_decision_eventand siblings). The input is a well-defined event payload shape (fields incontent,attributes,content_partscarrying typed values). The output is typedExtractedNode/ExtractedEdgeinstances with ontology-valid property values. This is the profile the paper's evidence covers cleanly.Compile phase (runs once per
(ontology, binding, event_schema, extraction_rules, compiler_version))event_typevalues and their expected payload shapes), per-event extraction rules (event_type X → entity Y with properties from fields {...}), and a sample of ≥ 100 real events per coveredevent_type.string,int64,float64,bool,date,datetime,timestamp,time,bytes,numeric,decimal), json (opaque structured blob), and reference (to another entity's key). Arrays and structs are explicitly deferred in ontology v0 (docs/ontology/ontology.md:293) and are modeled as separate entities + relationships, not as nested properties — so Phase 1 templates do not cover them. Enums are represented viajsonplus an in-template membership check against a declared value list; if the ontology model later grows a first-classenumtype, it gets its own template then. Generation is constrained; the LLM cannot emit free-form code, only field-level template fills.The distinction between the ontology model's field kinds (constrained by ontology v0) and the event payload shape (which may have richer structure, e.g., arrays of tool-result objects in a
TOOL_COMPLETEDevent'scontent) is handled separately. Event-payload arrays are flattened into repeatedExtractedEdgeorExtractedNodeemissions by the extraction rules, not carried through as array properties.event_schemaandextraction_rulesare the place to declare this shape; they are a separate input to the compiler, not part of the ontology.3. Static check: AST-validate the emitted code. Reject anything that's not pure-Python, reads from unexpected fields, calls out-of-allowlist functions, or has side effects.
4. Smoke test: run the generated extractor on the sample set. For each event, compare the generated extractor's output to a reference output produced by (a) the hand-written extractor, if one exists, or (b)
AI.GENERATEwith the same prompt as Phase-0 fallback. Accept iff field-level F1 ≥ threshold (threshold TBD per extractor type; proposed start: 0.95).5. Ontology validation: run the full
validate_extracted_graphfrom P0.1 over the smoke-test outputs. Any validator failure blocks acceptance.If any of stages 3–5 fails, the compile run fails and the hand-written or
AI.GENERATEextractor continues to be used.Runtime phase
run_structured_extractors()loads compiled bundles fromcompiled_extractors/<fingerprint>/if a bundle matches the active(ontology, binding, event_schema, compiler_version). Otherwise falls back to the hand-written registry entries.validate_extracted_graphflags a field, that field's value is replaced by the hand-written or LLM-based extraction result for that field only. Logged with a trace-shape signature so the next compile run can cover it.ontology_graph.pyorcontext_graph.py. The compile harness, smoke tests, and measurement report are self-contained.run_structured_extractors()hook (structured_extraction.py:198–232). The pruning + hint behavior atontology_graph.py:540–571is reused unchanged — compiled extractors emitStructuredExtractionResultwith the rightfully_handled_span_ids/partially_handled_span_idspartitioning, and the existing AI SQL semantics (_EXTRACT_ONTOLOGY_AI_QUERYatontology_graph.py:90–136) are unchanged. C2's only change to the AI side is what's passed in theexcluded_span_ids/partial_span_idsparameters, not how the query itself is built.Fingerprint (expanded per review)
sha256(ontology, binding, event_schema, event_allowlist, transcript_builder_version, content_serialization_rules, extraction_rules, template_version, compiler_package_version)Changes in the trace-shape dependencies invalidate the bundle just as surely as changes in the ontology do. Stale bundles refuse to load and the runtime falls back to hand-written / LLM extraction until
gm compile-extractorsreruns.Measured outcomes before Phase 2 proceeds
Phase 1 must produce, against a reference ontology + real trace corpus:
AI.GENERATE. Measured at the field level on the smoke-test corpus.JOBS_BY_PROJECT.total_bytes_processed,total_slot_ms) for the build's BQ jobs. Per-event token attribution is not available — billing is session-aggregated and the AI query (ontology_graph.py:90–136) returns only(session_id, graph_json)with no per-row usage metadata. (Aligned with Doc: V5 migration notebook storyboard for the four-guarantee decision-lineage narrative #107's billing-honest column set.)If F1 < 0.95 or fallback rate > 10%, Phase 2 does not proceed. The structured-extractor compilation has to beat or match the hand-written baseline before taking on the harder free-text case.
Phase 2 — compile session-aggregated ontology-graph extractors (only after Phase 1 validates)
Only begun after Phase 1's measurements show the compile→validate loop works. Targets
ontology_graph.extract_graph()'s session-aggregatedAI.GENERATEpath. This is the tier where the paper's "57× token reduction" has the most direct mapping, but it also has more open-ended inputs (the session transcript is not a known schema the way an individual event payload is).Phase 2 scope is intentionally deferred until Phase 1 data is in. If Phase 1 shows deterministic extractors can match
AI.GENERATEon structured events with > 0.95 F1 and < 10% fallback, Phase 2 can reuse the compiler infrastructure. If not, Phase 2 doesn't happen and the epic's scope contracts to Phase 1 only.The runtime-target decision from P0.2 is re-evaluated here. Option A (client-side Python) may no longer be acceptable at session scale; Option C (SQL UDF) becomes the likely target, since session-aggregated extraction currently runs in SQL.
Phase 3 — free-text extraction (explicitly out of scope)
context_graph.py's row-levelAI.GENERATEextraction of business entities fromLLM_RESPONSEtext is not proposed for compilation in this epic. The paper's evidence doesn't cover this profile, and the review correctly flagged that deterministic generated code is much less convincing there.A separate issue may propose structured-NLU approaches (entity recognizers, intent classifiers) for that path later. This epic does not commit to it.
Revalidation harness (shared across phases once compilation lands)
Scheduled or on-demand job:
AI.GENERATE).No auto-recompile — the compile-time LLM call is the trust boundary. Human decision to re-run, backed by the measurement.
Risks + mitigations (revised)
Open questions
structured_extraction.pyhas exactly one hand-written extractor:extract_bka_decision_eventat structured_extraction.py:120. The first implementation slice is therefore "compileextract_bka_decision_eventfirst," with the hand-written version as the F1 ground truth for its own compiled replacement. Before promoting compilation to a general solution, add one or two more hand-written extractors (likelyTOOL_COMPLETEDresult shapes and a HITL event) as hand-written baselines, so the smoke-test harness has multi-extractor coverage and the F1 metric isn't a single-point measurement.ontology.yaml(auditable, reviewable), emitted as a versioned BQ table (runtime-discoverable), or both. Leaning both — in-repo file is source of truth, BQ-table mirror is for runtime discovery.Related work in-repo
skos_-prefixed relationships as part of the template set.Reference