Skip to content

Commit 0760b97

Browse files
Merge pull request #59 from haiyuan-eng-google/main
Implement telemetry labels for BigQuery jobs across SDK phases
2 parents 806dba8 + f11e875 commit 0760b97

34 files changed

Lines changed: 3299 additions & 153 deletions

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,14 @@ regressions — all through BigQuery SQL or Python.
5454
- Streaming evaluation (Cloud Scheduler + Cloud Run)
5555
- Continuous query templates
5656

57+
**Usage Telemetry**
58+
- Every job the SDK submits is labeled (`sdk`, `sdk_version`,
59+
`sdk_surface`, `sdk_feature`, and `sdk_ai_function` where relevant)
60+
so operators can attribute spend, latency, and adoption directly
61+
from `INFORMATION_SCHEMA.JOBS_BY_PROJECT`. No extra telemetry
62+
pipeline is required. See [docs/sdk_usage_tracking.md](docs/sdk_usage_tracking.md)
63+
for the label schema and ready-to-run tracking queries.
64+
5765
## Prerequisites
5866

5967
- Python 3.10+

SDK.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1941,6 +1941,54 @@ handle aggregation.
19411941

19421942
---
19431943

1944+
## 22. Usage Telemetry
1945+
1946+
Every BigQuery job the SDK submits is labeled so operators can
1947+
attribute spend, latency, and adoption directly from
1948+
`INFORMATION_SCHEMA.JOBS` without running a separate telemetry
1949+
pipeline.
1950+
1951+
**Label schema**
1952+
1953+
| Key | Value |
1954+
| --- | ----- |
1955+
| `sdk` | constant `bigquery-agent-analytics` |
1956+
| `sdk_version` | package version, BQ-safe (e.g. `0-4-0`) |
1957+
| `sdk_surface` | `python` \| `cli` \| `remote-function` |
1958+
| `sdk_feature` | `trace-read` \| `eval-code` \| `eval-llm-judge` \| `eval-categorical` \| `insights` \| `drift` \| `memory` \| `context-graph` \| `ontology-build` \| `ontology-gql` \| `views` \| `ai-ml` \| `feedback` |
1959+
| `sdk_ai_function` | set only on AI/ML invocations: `ai-generate` \| `ai-embed` \| `ai-classify` \| `ai-forecast` \| `ai-detect-anomalies` \| `ml-generate-text` \| `ml-generate-embedding` \| `ml-detect-anomalies` \| `ml-forecast` |
1960+
1961+
All labels also apply to load jobs submitted by the SDK (e.g. the
1962+
ontology materializer's batch-load path). Streaming inserts via
1963+
`insert_rows_json` are not jobs and do not carry labels.
1964+
1965+
**Opt-in / opt-out**
1966+
1967+
- The default `Client(...)` constructor returns a labeled client.
1968+
- Explicit `make_bq_client(...)` lets you customize the underlying
1969+
`bigquery.Client` (e.g. `default_query_job_config`) while keeping
1970+
labels.
1971+
- Passing a vanilla `bigquery.Client` via `bq_client=...` is honored
1972+
as-is; the SDK logs a one-shot `WARNING` and skips labeling so
1973+
your caller-side client settings survive intact.
1974+
1975+
**Reserved namespace and user labels**
1976+
1977+
The `sdk*` keys are SDK-managed. If a caller pre-sets one, the SDK
1978+
overrides it with a `WARNING`. Non-`sdk*` labels on the
1979+
`QueryJobConfig.labels` dict (for example your team or cost-center
1980+
tags) are preserved and coexist with the SDK labels — useful for
1981+
joining SDK spend against your own dimensions.
1982+
1983+
**Tracking queries**
1984+
1985+
See [docs/sdk_usage_tracking.md](docs/sdk_usage_tracking.md) for
1986+
ready-to-run SQL templates: feature adoption, AI/ML function cost
1987+
breakdown, p50/p95 latency by feature, version-adoption histograms,
1988+
and surface attribution.
1989+
1990+
---
1991+
19441992
## Module Architecture
19451993

19461994
```

docs/design.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1228,7 +1228,6 @@ results = client.query(formatted, job_config=job_config)
12281228
| `ai_ml_integration.py` | `_DETECT_LATENCY_ANOMALIES_QUERY` | ARIMA anomaly detection |
12291229
| `ai_ml_integration.py` | `_CREATE_BEHAVIOR_MODEL_QUERY` | Autoencoder model training DDL |
12301230
| `ai_ml_integration.py` | `_BATCH_EVALUATION_QUERY` | Batch evaluation via AI.GENERATE |
1231-
| `ai_ml_integration.py` | `_LEGACY_BATCH_EVALUATION_QUERY` | Legacy batch evaluation |
12321231

12331232
---
12341233

docs/sdk_usage_tracking.md

Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
# SDK Usage Tracking via `INFORMATION_SCHEMA.JOBS`
2+
3+
Every BigQuery job the SDK submits is labeled. Those labels land in
4+
BigQuery's native `INFORMATION_SCHEMA.JOBS` views, so you can attribute
5+
spend and usage back to the SDK without running a separate telemetry
6+
pipeline.
7+
8+
This document is the operator cookbook: what labels exist, how to read
9+
them, and ready-to-run SQL.
10+
11+
---
12+
13+
## Label schema
14+
15+
Applied by the SDK to every query job (`QueryJobConfig.labels`) and
16+
load job (`LoadJobConfig.labels`) it submits.
17+
18+
| Key | Value | Scope |
19+
| ------------------ | ------------------------------------------ | ----- |
20+
| `sdk` | constant `bigquery-agent-analytics` | every SDK job |
21+
| `sdk_version` | `__version__`, BQ-safe (e.g. `0-4-0`) | every SDK job |
22+
| `sdk_surface` | `python` \| `cli` \| `remote-function` | every SDK job |
23+
| `sdk_feature` | `trace-read` \| `eval-code` \| `eval-llm-judge` \| `eval-categorical` \| `insights` \| `drift` \| `memory` \| `context-graph` \| `ontology-build` \| `ontology-gql` \| `views` \| `ai-ml` \| `feedback` | per-call site |
24+
| `sdk_ai_function` | `ai-generate` \| `ai-embed` \| `ai-classify` \| `ai-forecast` \| `ai-detect-anomalies` \| `ml-generate-text` \| `ml-generate-embedding` \| `ml-detect-anomalies` \| `ml-forecast` | AI/ML invocations only |
25+
26+
**Reserved namespace.** All `sdk*` keys are managed by the SDK. If a
27+
caller pre-sets any of these on a `QueryJobConfig.labels` dict passed
28+
to the SDK, the SDK overrides them and logs a one-shot `WARNING`. This
29+
keeps telemetry trustworthy. Non-`sdk*` user labels (e.g.
30+
`team=search`) are preserved unchanged and show up alongside the SDK
31+
labels in `INFORMATION_SCHEMA` — useful for joining SDK spend against
32+
your own cost-center dimensions.
33+
34+
**Privacy.** SDK labels never contain `user_id`, `session_id`,
35+
`trace_id`, or any trace-extracted value. `INFORMATION_SCHEMA.JOBS` is
36+
readable by anyone with `bigquery.jobs.listAll`; the SDK enforces the
37+
`[a-z0-9_-]{1,63}` label-value format that BigQuery itself requires,
38+
which also rejects most PII shapes (emails, UUIDs with dashes only
39+
pass, etc. — avoid adding trace-derived values to any custom labels
40+
you set).
41+
42+
**Out of scope.** Streaming inserts via `insert_rows_json` /
43+
`tabledata.insertAll` are **not** jobs, do not support labels, and do
44+
not appear in `INFORMATION_SCHEMA.JOBS`. To observe those, use Cloud
45+
Audit Logs.
46+
47+
---
48+
49+
## Prerequisites
50+
51+
- Read access to `INFORMATION_SCHEMA.JOBS_BY_PROJECT` or
52+
`INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION` — typically `bigquery.jobs.listAll`
53+
plus appropriate dataset/organization IAM.
54+
- Replace `region-us` in the queries below with your BigQuery region
55+
(e.g. `region-eu`, `region-asia-northeast1`). The region is the
56+
BigQuery **multi-region or location** of the dataset where jobs run.
57+
58+
---
59+
60+
## Queries
61+
62+
### 1. Feature adoption over the last 30 days
63+
64+
Which SDK features are being used, from which surface, and how much
65+
do they cost?
66+
67+
```sql
68+
SELECT
69+
(SELECT value FROM UNNEST(labels) WHERE key = 'sdk_feature') AS feature,
70+
(SELECT value FROM UNNEST(labels) WHERE key = 'sdk_surface') AS surface,
71+
COUNT(*) AS jobs,
72+
SUM(total_bytes_billed) / POW(2, 40) AS tib_billed,
73+
SUM(TIMESTAMP_DIFF(end_time, start_time, MILLISECOND)) / 1000.0 / 60
74+
AS total_minutes
75+
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
76+
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
77+
AND EXISTS (SELECT 1 FROM UNNEST(labels) WHERE key = 'sdk')
78+
GROUP BY feature, surface
79+
ORDER BY jobs DESC;
80+
```
81+
82+
### 2. AI/ML function cost breakdown
83+
84+
Where is your `AI.GENERATE` / `AI.EMBED` / `AI.FORECAST` spend going?
85+
86+
```sql
87+
SELECT
88+
(SELECT value FROM UNNEST(labels) WHERE key = 'sdk_ai_function')
89+
AS ai_function,
90+
(SELECT value FROM UNNEST(labels) WHERE key = 'sdk_feature') AS feature,
91+
COUNT(*) AS jobs,
92+
SUM(total_bytes_billed) / POW(2, 40) AS tib_billed,
93+
AVG(TIMESTAMP_DIFF(end_time, start_time, MILLISECOND)) AS avg_ms
94+
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
95+
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
96+
AND EXISTS (
97+
SELECT 1 FROM UNNEST(labels) WHERE key = 'sdk_ai_function'
98+
)
99+
GROUP BY ai_function, feature
100+
ORDER BY tib_billed DESC;
101+
```
102+
103+
### 3. Slowest feature per day (p50 / p95 latency)
104+
105+
Which features are degrading or have runaway outliers?
106+
107+
```sql
108+
SELECT
109+
DATE(creation_time) AS day,
110+
(SELECT value FROM UNNEST(labels) WHERE key = 'sdk_feature') AS feature,
111+
COUNT(*) AS jobs,
112+
APPROX_QUANTILES(
113+
TIMESTAMP_DIFF(end_time, start_time, MILLISECOND), 100
114+
)[OFFSET(50)] AS p50_ms,
115+
APPROX_QUANTILES(
116+
TIMESTAMP_DIFF(end_time, start_time, MILLISECOND), 100
117+
)[OFFSET(95)] AS p95_ms
118+
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
119+
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 14 DAY)
120+
AND EXISTS (SELECT 1 FROM UNNEST(labels) WHERE key = 'sdk')
121+
AND state = 'DONE'
122+
GROUP BY day, feature
123+
HAVING jobs >= 5
124+
ORDER BY day DESC, p95_ms DESC;
125+
```
126+
127+
### 4. Version adoption after a release
128+
129+
How many jobs are still on the old version after you cut a new one?
130+
131+
```sql
132+
SELECT
133+
(SELECT value FROM UNNEST(labels) WHERE key = 'sdk_version') AS sdk_version,
134+
DATE(creation_time) AS day,
135+
COUNT(*) AS jobs
136+
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
137+
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 14 DAY)
138+
AND EXISTS (SELECT 1 FROM UNNEST(labels) WHERE key = 'sdk')
139+
GROUP BY sdk_version, day
140+
ORDER BY day DESC, jobs DESC;
141+
```
142+
143+
### 5. Surface attribution (who is calling the SDK?)
144+
145+
Split spend across direct Python users, CLI invocations, and the
146+
deployed remote-function runtime.
147+
148+
```sql
149+
SELECT
150+
(SELECT value FROM UNNEST(labels) WHERE key = 'sdk_surface') AS surface,
151+
COUNT(*) AS jobs,
152+
SUM(total_bytes_billed) / POW(2, 40) AS tib_billed
153+
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
154+
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
155+
AND EXISTS (SELECT 1 FROM UNNEST(labels) WHERE key = 'sdk')
156+
GROUP BY surface
157+
ORDER BY tib_billed DESC;
158+
```
159+
160+
### 6. Errors by feature
161+
162+
Are any SDK features failing disproportionately?
163+
164+
```sql
165+
SELECT
166+
(SELECT value FROM UNNEST(labels) WHERE key = 'sdk_feature') AS feature,
167+
error_result.reason AS reason,
168+
COUNT(*) AS failed_jobs
169+
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
170+
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
171+
AND EXISTS (SELECT 1 FROM UNNEST(labels) WHERE key = 'sdk')
172+
AND state = 'DONE'
173+
AND error_result.reason IS NOT NULL
174+
GROUP BY feature, reason
175+
ORDER BY failed_jobs DESC;
176+
```
177+
178+
### 7. Custom caller labels joined with SDK labels
179+
180+
If your callers add their own labels (e.g. `team=search`,
181+
`env=prod`) before handing a `QueryJobConfig` to the SDK, those
182+
survive and co-exist with the SDK's labels. You can slice SDK usage
183+
by your own cost-center dimensions:
184+
185+
```sql
186+
SELECT
187+
(SELECT value FROM UNNEST(labels) WHERE key = 'team') AS team,
188+
(SELECT value FROM UNNEST(labels) WHERE key = 'sdk_feature') AS feature,
189+
COUNT(*) AS jobs,
190+
SUM(total_bytes_billed) / POW(2, 40) AS tib_billed
191+
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
192+
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
193+
AND EXISTS (SELECT 1 FROM UNNEST(labels) WHERE key = 'sdk')
194+
AND EXISTS (SELECT 1 FROM UNNEST(labels) WHERE key = 'team')
195+
GROUP BY team, feature
196+
ORDER BY tib_billed DESC;
197+
```
198+
199+
---
200+
201+
## Opting in and out
202+
203+
### By default: opt-in
204+
205+
Constructing the SDK the normal way gets you labels on every job:
206+
207+
```python
208+
from bigquery_agent_analytics import Client
209+
210+
# sdk_surface defaults to "python"; bq_client is lazily built via
211+
# make_bq_client, which returns a LabeledBigQueryClient.
212+
client = Client(project_id="my-proj", dataset_id="analytics")
213+
```
214+
215+
### Explicitly construct the labeled client
216+
217+
If you need your own `google.cloud.bigquery.Client` configuration
218+
(custom `client_info`, `default_query_job_config`, transport, etc.)
219+
but still want SDK labels, use `make_bq_client`:
220+
221+
```python
222+
from bigquery_agent_analytics import make_bq_client, Client
223+
224+
bq = make_bq_client(project="my-proj", location="US", sdk_surface="python")
225+
# ... mutate bq.default_query_job_config, etc., if you want.
226+
227+
client = Client(project_id="my-proj", dataset_id="analytics", bq_client=bq)
228+
```
229+
230+
### Pass your own client — labels are NOT applied
231+
232+
If you pass a vanilla `bigquery.Client` to `Client(bq_client=...)`,
233+
the SDK honors it as-is (no reconstruction, so your
234+
`default_query_job_config` and other settings survive) and logs a
235+
one-shot `WARNING` noting that SDK labels will not be applied:
236+
237+
```python
238+
from google.cloud import bigquery
239+
from bigquery_agent_analytics import Client
240+
241+
client = Client(
242+
project_id="my-proj",
243+
dataset_id="analytics",
244+
bq_client=bigquery.Client(project="my-proj"),
245+
# Jobs from this Client will NOT carry sdk_* labels.
246+
# The SDK logs one WARNING explaining how to opt in.
247+
)
248+
```
249+
250+
---
251+
252+
## Related
253+
254+
- See `SDK.md` for the full consumption-layer API reference.
255+
- See [issue #52 on GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK][issue-52]
256+
for the design discussion and rollout history.
257+
258+
[issue-52]: https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK/issues/52

src/bigquery_agent_analytics/__init__.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,22 @@
4646

4747
__all__ = []
4848

49+
# --- Telemetry primitives (always available) ---
50+
# Exposed as public API so operators who want SDK labels on a custom
51+
# bigquery.Client configuration can opt in via make_bq_client, and
52+
# advanced users can pass a LabeledBigQueryClient directly.
53+
from ._telemetry import LabeledBigQueryClient
54+
from ._telemetry import make_bq_client
55+
from ._telemetry import with_sdk_labels
56+
57+
__all__.extend(
58+
[
59+
"LabeledBigQueryClient",
60+
"make_bq_client",
61+
"with_sdk_labels",
62+
]
63+
)
64+
4965
# --- SDK Client & Core ---
5066
try:
5167
from .client import Client

src/bigquery_agent_analytics/_deploy_runtime.py

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,19 @@
2929

3030
def resolve_client_options(
3131
user_defined_context: dict[str, Any] | None = None,
32+
sdk_surface: str = "remote-function",
3233
) -> dict[str, Any]:
33-
"""Resolve ``Client`` constructor kwargs from request context + env vars."""
34+
"""Resolve ``Client`` constructor kwargs from request context + env vars.
35+
36+
Args:
37+
user_defined_context: Optional request-context dict forwarded by the
38+
caller (e.g. BigQuery Remote Function ``userDefinedContext``).
39+
sdk_surface: Value stamped on the ``sdk_surface`` telemetry label.
40+
Defaults to ``"remote-function"`` because both shipped entry
41+
points (BQ Remote Function dispatch, streaming-eval worker) are
42+
remote runtimes. Callers that want a different surface (e.g. a
43+
future ``"continuous-query"``) pass it explicitly.
44+
"""
3445
udc = user_defined_context or {}
3546
project_id = udc.get("project_id", os.environ.get("BQ_AGENT_PROJECT"))
3647
dataset_id = udc.get("dataset_id", os.environ.get("BQ_AGENT_DATASET"))
@@ -58,11 +69,15 @@ def resolve_client_options(
5869
"verify_schema": False,
5970
"endpoint": endpoint,
6071
"connection_id": connection_id,
72+
"sdk_surface": sdk_surface,
6173
}
6274

6375

6476
def build_client_from_context(
6577
user_defined_context: dict[str, Any] | None = None,
78+
sdk_surface: str = "remote-function",
6679
) -> Client:
6780
"""Build a ``Client`` from request context + deployment env vars."""
68-
return Client(**resolve_client_options(user_defined_context))
81+
return Client(
82+
**resolve_client_options(user_defined_context, sdk_surface=sdk_surface)
83+
)

0 commit comments

Comments
 (0)