Skip to content

Commit a8ce338

Browse files
committed
feat: add bigquery ai-ml skills
1 parent 1bc7fbd commit a8ce338

12 files changed

Lines changed: 853 additions & 0 deletions

skills/bigquery-ai-ml/SKILL.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
---
2+
name: bigquery-ai-ml
3+
license: Apache-2.0
4+
metadata:
5+
author: google-adk
6+
version: "1.0"
7+
description: |
8+
Skill for BigQuery AI and Machine Learning queries using standard SQL
9+
and `AI.*` functions (preferred over dedicated tools).
10+
11+
---
12+
13+
# Skill: bigquery-ai-ml
14+
15+
This skill defines the usage and rules for BigQuery AI/ML functions,
16+
preferring SQL-based Skills over dedicated BigQuery tools.
17+
18+
## 1. Skill vs Tool Preference (BigQuery AI/ML)
19+
20+
Agents should **prefer using the Skill (SQL via `execute_sql()`)** over
21+
dedicated BigQuery tools for functionalities like Forecasting and Anomaly
22+
Detection.
23+
24+
Use `execute_sql()` with the standard BigQuery `AI.*` functions for these tasks
25+
instead of the corresponding high-level tools.
26+
27+
## 2. Mandatory Reference Routing
28+
29+
This skill file does not contain the syntax for these functions. You **MUST**
30+
read the associated reference file before generating SQL.
31+
32+
**CRITICAL**: DO NOT GUESS filenames. You MUST only use the exact paths
33+
provided below.
34+
35+
| Function | Description | Required Reference File to Retrieve |
36+
| :--- | :--- | :--- |
37+
| **AI.FORECAST** | Time-series forecasting via the pre-trained TimesFM model | `references/bigquery_ai_forecast.md` |
38+
| **AI.CLASSIFY** | Categorize unstructured data into predefined labels | `references/bigquery_ai_classify.md` |
39+
| **AI.DETECT_ANOMALIES** | Identify deviations in time-series data via the pre-trained TimesFM model | `references/bigquery_ai_detect_anomalies.md` |
40+
| **AI.GENERATE** | General-purpose text and content generation | `references/bigquery_ai_generate.md` |
41+
| **AI.GENERATE_BOOL** | Generate a boolean value (TRUE/FALSE) based on a prompt | `references/bigquery_ai_generate_bool.md` |
42+
| **AI.GENERATE_DOUBLE** | Generate a floating-point number based on a prompt | `references/bigquery_ai_generate_double.md` |
43+
| **AI.GENERATE_INT** | Generate an integer value based on a prompt | `references/bigquery_ai_generate_int.md` |
44+
| **AI.IF** | Evaluate a natural-language boolean condition | `references/bigquery_ai_if.md` |
45+
| **AI.SCORE** | Rank items by semantic relevance (use with ORDER BY) | `references/bigquery_ai_score.md` |
46+
| **AI.SIMILARITY** | Compute cosine similarity between two inputs | `references/bigquery_ai_similarity.md` |
47+
| **AI.SEARCH** | Semantic search on tables with autonomous embedding generation | `references/bigquery_ai_search.md` |
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# BigQuery AI.Classify
2+
3+
`AI.CLASSIFY` categorizes unstructured data into a predefined set of labels.
4+
5+
## Syntax Reference
6+
7+
```sql
8+
AI.CLASSIFY(
9+
[ input => ] 'INPUT',
10+
[ categories => ] 'CATEGORIES'
11+
[, connection_id => 'CONNECTION_ID' ]
12+
[, endpoint => 'ENDPOINT' ]
13+
[, output_mode => 'OUTPUT_MODE' ]
14+
)
15+
```
16+
17+
### Input Arguments
18+
19+
| Argument | Requirement | Type | Description |
20+
| :------------------ | :----------- | :------------ | :-------------------- |
21+
| **`input`** | **Required** | String | The text content to |
22+
: : : : classify. :
23+
| **`categories`** | **Required** | Array<String> | A list of target |
24+
: : : : categories/labels. :
25+
: : : : Can be :
26+
: : : : `ARRAY<STRING>` or :
27+
: : : : `ARRAY<STRUCT<STRING, :
28+
: : : : STRING>>` (label, :
29+
: : : : description). :
30+
| **`connection_id`** | Optional | String | The connection ID to |
31+
: : : : use for the LLM. :
32+
| **`endpoint`** | Optional | String | The model name, e.g., |
33+
: : : : `'gemini-2.5-flash'`. :
34+
| **`output_mode`** | Optional | String | `'single'` (default) |
35+
: : : : or `'multi'`. :
36+
: : : : Determines the output :
37+
: : : : type. :
38+
39+
### Output Schema
40+
41+
The output type depends on the `output_mode` argument:
42+
43+
| Output Mode | output_mode Value | Type | Description |
44+
| :--------------- | :---------------- | :-------------- | :------------------ |
45+
| **Single Label** | `NULL` (Default) | `STRING` | The single category |
46+
: : : : that best fits the :
47+
: : : : input. :
48+
| **Single Label | `'single'` | `ARRAY<STRING>` | An array containing |
49+
: (Explicit)** : : : exactly one :
50+
: : : : category string. :
51+
| **Multi Label** | `'multi'` | `ARRAY<STRING>` | An array containing |
52+
: : : : zero or more :
53+
: : : : matching :
54+
: : : : categories. :
55+
56+
## Examples
57+
58+
### Classify text into categories
59+
60+
```sql
61+
SELECT
62+
content,
63+
AI.CLASSIFY(
64+
content,
65+
categories => ['Spam', 'Not Spam', 'Urgent'],
66+
connection_id => 'my-project.us.my-connection'
67+
) as classification
68+
FROM `dataset.emails`;
69+
```
70+
71+
### Classify text into multiple topics
72+
73+
```
74+
SELECT
75+
title,
76+
body,
77+
AI.CLASSIFY(
78+
body,
79+
categories => ['tech', 'sport', 'business', 'politics', 'entertainment', 'other'],
80+
output_mode => 'multi') AS categories
81+
FROM
82+
`bigquery-public-data.bbc_news.fulltext`
83+
LIMIT 100;
84+
```
85+
86+
### Classify reviews by sentiment
87+
88+
SELECT AI.CLASSIFY( ('Classify the review by sentiment: ', review), categories
89+
=> [('green', 'The review is positive.'), ('yellow', 'The review is neutral.'),
90+
('red', 'The review is negative.')]) AS ai_review_rating, reviewer_rating AS
91+
human_provided_rating, review, FROM `bigquery-public-data.imdb.reviews` WHERE
92+
title = 'The English Patient'
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# BigQuery AI.Detect_Anomalies
2+
3+
`AI.DETECT_ANOMALIES` uses the pre-trained **TimesFM** model to identify
4+
deviations in time series data without needing to train a custom model.
5+
6+
## Syntax Reference
7+
8+
This function compares a target dataset against a historical dataset to identify
9+
anomalies.
10+
11+
```sql
12+
SELECT *
13+
FROM AI.DETECT_ANOMALIES(
14+
{ TABLE `project.dataset.history_table` | (SELECT * FROM history_query) },
15+
{ TABLE `project.dataset.target_table` | (SELECT * FROM target_query) },
16+
data_col => 'DATA_COL',
17+
timestamp_col => 'TIMESTAMP_COL'
18+
[, model => 'MODEL']
19+
[, id_cols => ID_COLS]
20+
[, anomaly_prob_threshold => ANOMALY_PROB_THRESHOLD]
21+
)
22+
23+
```
24+
25+
### Input Arguments
26+
27+
Argument | Requirement | Type | Description
28+
:--------------------------- | :----------- | :------------ | :----------
29+
**`historical_data`** | **Required** | Table/Query | The source table or subquery containing historical data for training context.
30+
**`target_data`** | **Required** | Table/Query | The source table or subquery containing data to analyze for anomalies.
31+
**`data_col`** | **Required** | String | The numeric column to analyze.
32+
**`timestamp_col`** | **Required** | String | The column containing dates/timestamps.
33+
**`id_cols`** | Optional | Array<String> | Grouping columns for multiple series (e.g., `['store_id']`).
34+
**`anomaly_prob_threshold`** | Optional | Float64 | Threshold for anomaly detection (0 to 1). Defaults to 0.95.
35+
**`model`** | Optional | String | Model version. Defaults to `'TimesFM 2.0'`.
36+
37+
### Output Schema
38+
39+
| Column | Type | Description |
40+
| :------------------------------- | :--------- | :--------------------------- |
41+
| **`id_cols`** | (As Input) | Original identifiers for the |
42+
: : : series. :
43+
| **`time_series_timestamp`** | TIMESTAMP | Timestamp for the analyzed |
44+
: : : points. :
45+
| **`time_series_data`** | FLOAT64 | The original data value. |
46+
| **`is_anomaly`** | BOOL | TRUE if the point is |
47+
: : : identified as an anomaly. :
48+
| **`lower_bound`** | FLOAT64 | Lower bound of the expected |
49+
: : : range. :
50+
| **`upper_bound`** | FLOAT64 | Upper bound of the expected |
51+
: : : range. :
52+
| **`anomaly_probability`** | FLOAT64 | Probability that the point |
53+
: : : is an anomaly. :
54+
| **`ai_detect_anomalies_status`** | STRING | Error messages or empty |
55+
: : : string on success. A minimum :
56+
: : : of 3 data points is :
57+
: : : required. :
58+
59+
## Examples
60+
61+
### Basic Anomaly Detection
62+
63+
Detect anomalies in daily bike trips for a specific 2-month window based on
64+
prior history.
65+
66+
```sql
67+
WITH bike_trips AS (
68+
SELECT EXTRACT(DATE FROM starttime) AS date, COUNT(*) AS num_trips
69+
FROM `bigquery-public-data.new_york.citibike_trips`
70+
GROUP BY date
71+
)
72+
SELECT *
73+
FROM AI.DETECT_ANOMALIES(
74+
-- Historical context (Training data equivalent)
75+
(SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')),
76+
-- Target range (Data to inspect for anomalies)
77+
(SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'),
78+
data_col => 'num_trips',
79+
timestamp_col => 'date'
80+
);
81+
82+
```
83+
84+
### Multivariate Detection (Multiple Series)
85+
86+
Use `id_cols` to detect anomalies separately for different user types (e.g.,
87+
Subscriber vs. Customer) in the same query.
88+
89+
```sql
90+
WITH bike_trips AS (
91+
SELECT
92+
EXTRACT(DATE FROM starttime) AS date, usertype, gender,
93+
COUNT(*) AS num_trips
94+
FROM `bigquery-public-data.new_york.citibike_trips`
95+
GROUP BY date, usertype, gender
96+
)
97+
SELECT *
98+
FROM
99+
AI.DETECT_ANOMALIES(
100+
# Historical data from a query
101+
(SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')),
102+
# Target data from a query
103+
(SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'),
104+
data_col => 'num_trips',
105+
timestamp_col => 'date',
106+
id_cols => ['usertype', 'gender'],
107+
model => "TimesFM 2.5",
108+
anomaly_prob_threshold => 0.8);
109+
110+
```
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# BigQuery AI.Forecast
2+
3+
`AI.FORECAST` leverages the pre-trained **TimesFM** foundation model to generate
4+
forecasts without the need to train and manage custom models.
5+
6+
## Syntax Reference
7+
8+
```sql
9+
SELECT
10+
*
11+
FROM
12+
AI.FORECAST(
13+
{ TABLE `project.dataset.table` | (QUERY_STATEMENT) },
14+
data_col => 'DATA_COL',
15+
timestamp_col => 'TIMESTAMP_COL'
16+
[, model => 'MODEL']
17+
[, id_cols => ID_COLS]
18+
[, horizon => HORIZON]
19+
[, confidence_level => CONFIDENCE_LEVEL]
20+
[, output_historical_time_series => OUTPUT_HISTORICAL_TIME_SERIES]
21+
[, context_window => CONTEXT_WINDOW]
22+
)
23+
```
24+
25+
### Input Arguments
26+
27+
| Argument | Requirement | Type | Description |
28+
| :--------------------- | :----------- | :------------ | :---------------- |
29+
| **`input_data`** | **Required** | | The source table |
30+
: : : : or subquery :
31+
: : : : containing :
32+
: : : : historical data. :
33+
| **`data_col`** | **Required** | String | The numeric |
34+
: : : : column to :
35+
: : : : predict. :
36+
| **`timestamp_col`** | **Required** | String | The column |
37+
: : : : containing :
38+
: : : : dates/timestamps. :
39+
| **`id_cols`** | Optional | Array<String> | Grouping columns |
40+
: : : : for multiple :
41+
: : : : series (e.g., :
42+
: : : : `['store_id']`). :
43+
| **`horizon`** | Optional | Int64 | Number of future |
44+
: : : : points to :
45+
: : : : predict. Defaults :
46+
: : : : to 10. The valid :
47+
: : : : input range is :
48+
: : : : [1, 10,000] :
49+
| **`confidence_level`** | Optional | Float64 | Confidence |
50+
: : : : interval (0 to :
51+
: : : : 1). Defaults to :
52+
: : : : 0.95. :
53+
| **`model`** | Optional | String | Model version. |
54+
: : : : Defaults to :
55+
: : : : `'TimesFM 2.0'`. :
56+
| **`context_window`** | Optional | Int64 | The number of |
57+
: : : : historical data :
58+
: : : : points the model :
59+
: : : : uses to forecast. :
60+
: : : : The min value is :
61+
: : : : 64 and the max :
62+
: : : : value is 2048 for :
63+
: : : : `'TimesFM 2.0'`. :
64+
: : : : If not set, the :
65+
: : : : model determines :
66+
: : : : this :
67+
: : : : automatically. :
68+
69+
### Output Schema
70+
71+
The schema adjusts based on the `output_historical_time_series` flag.
72+
73+
Column | Type | Included if output_historical_time_series=FALSE | Included if output_historical_time_series=TRUE | Description
74+
:------------------------------------ | :--------- | :---------------------------------------------- | :--------------------------------------------- | :----------
75+
**`id_cols`** | (As Input) | Yes | Yes | Original identifiers for the series.
76+
**`forecast_timestamp`** | TIMESTAMP | **Yes** | No | Timestamp for predicted points.
77+
**`forecast_value`** | FLOAT64 | **Yes** | No | The 50% quantile (median) prediction.
78+
**`time_series_timestamp`** | TIMESTAMP | No | **Yes** | Uniform timestamp column for both history and forecast.
79+
**`time_series_data`** | FLOAT64 | No | **Yes** | Merged column: actual values for history, median for forecast.
80+
**`time_series_type`** | STRING | No | **Yes** | Label: `'history'` or `'forecast'`.
81+
**`prediction_interval_lower_bound`** | FLOAT64 | Yes | Yes | Lower bound (NULL for historical rows).
82+
**`prediction_interval_upper_bound`** | FLOAT64 | Yes | Yes | Upper bound (NULL for historical rows).
83+
**`confidence_level`** | FLOAT64 | Yes | Yes | The constant confidence level used.
84+
**`ai_forecast_status`** | STRING | Yes | Yes | Error messages or empty string on success. A minimum of 3 data points is required.
85+
86+
## Examples
87+
88+
### Forecasting with History
89+
90+
```sql
91+
WITH
92+
citibike_trips AS (
93+
SELECT EXTRACT(DATE FROM starttime) AS date, usertype, COUNT(*) AS num_trips
94+
FROM `bigquery-public-data.new_york.citibike_trips`
95+
GROUP BY date, usertype
96+
)
97+
SELECT *
98+
FROM
99+
AI.FORECAST(
100+
TABLE citibike_trips,
101+
data_col => 'num_trips',
102+
timestamp_col => 'date',
103+
id_cols => ['usertype'],
104+
horizon => 30,
105+
output_historical_time_series => true);
106+
```

0 commit comments

Comments
 (0)