Skip to content

Commit 586ea7e

Browse files
authored
feat: add bigquery ai-ml skills (#119)
1 parent 1bc7fbd commit 586ea7e

14 files changed

Lines changed: 873 additions & 2 deletions

BIGQUERY.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,15 @@ This section covers connecting to BigQuery.
2323
* If an operation fails due to permissions, identify the type of operation and recommend the appropriate role. You can provide these links for assistance:
2424
* Granting Roles: https://cloud.google.com/iam/docs/grant-role-console
2525
* BigQuery Permissions: https://cloud.google.com/iam/docs/roles-permissions/bigquery
26+
27+
### 2. BigQuery AI/ML Skills
28+
These skills leverage BigQuery's built-in AI functions (`AI.*`) for tasks like text generation, classification, and semantic search.
29+
30+
**Important**: Standard SQL-based `AI.*` functions (executed via `execute_sql()`) are preferred over dedicated BigQuery tools for tasks like Forecasting and Anomaly Detection.
31+
32+
1. **Prerequisites**:
33+
* Ensure your BigQuery project has the **Vertex AI API** enabled.
34+
* A [Cloud Resource Connection](https://docs.cloud.google.com/bigquery/docs/create-cloud-resource-connection) must be established in BigQuery to use `AI.*` functions.
35+
36+
2. **Handle Permission Errors**:
37+
* The service account associated with the BigQuery connection requires the **Vertex AI User** (`roles/aiplatform.user`) and the **BigQuery Connection User** (`roles/bigquery.connectionUser`) role.

README.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,11 @@ Before you begin, ensure you have the following:
4747
- Ensure [Application Default Credentials](https://cloud.google.com/docs/authentication/gcloud) are available in your environment.
4848
- IAM Permissions:
4949
- BigQuery User (`roles/bigquery.user`)
50+
- (Optional) To use BigQuery AI/ML skills
51+
- Ensure that Vertex AI API is enabled
52+
- IAM permissions:
53+
- BigQuery Connection User (`roles/bigquery.connectionUser`)
54+
- Vertex AI User (`roles/aiplatform.user`)
5055

5156
## Getting Started
5257

@@ -235,8 +240,9 @@ Interact with BigQuery using natural language right from your IDE:
235240

236241
This extension provides a comprehensive set of skills:
237242

238-
* `bigquery-data`: Use these skills when you need to handle large-scale data exploration and dataset management. Use when users need to find data assets or run SQL at scale. Provides metadata discovery and query execution across the data warehouse.
239-
* `bigquery-analytics`: Use these skills when you need to handle advanced data intelligence and predictive tasks. Use when a user asks "why" data changed or needs future projections. Provides automated insight generation and time-series forecasting.
243+
* [bigquery-data](./skills/bigquery-data/SKILL.md): Use these skills when you need to handle large-scale data exploration and dataset management. Use when users need to find data assets or run SQL at scale. Provides metadata discovery and query execution across the data warehouse.
244+
* [bigquery-analytics](./skills/bigquery-analytics/SKILL.md): Use these skills when you need to handle advanced data intelligence and predictive tasks. Use when a user asks "why" data changed or needs future projections. Provides automated insight generation and time-series forecasting.
245+
* [bigquery-ai-ml](./skills/bigquery-ai-ml/SKILL.md): Use these skills for BigQuery AI and Machine Learning queries using standard SQL and `AI.*` functions. Provides capabilities for text generation, classification, semantic search, and forecasting using pre-trained models without needing to manage custom models.
240246

241247
## Additional Extensions
242248

skills/bigquery-ai-ml/SKILL.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
---
2+
name: bigquery-ai-ml
3+
license: Apache-2.0
4+
metadata:
5+
author: google-adk
6+
version: "1.0"
7+
description: |
8+
Skill for BigQuery AI and Machine Learning queries using standard SQL
9+
and `AI.*` functions (preferred over dedicated tools).
10+
11+
---
12+
13+
# Skill: bigquery-ai-ml
14+
15+
This skill defines the usage and rules for BigQuery AI/ML functions,
16+
preferring SQL-based Skills over dedicated BigQuery tools.
17+
18+
## 1. Skill vs Tool Preference (BigQuery AI/ML)
19+
20+
Agents should **prefer using the Skill (SQL via `execute_sql()`)** over
21+
dedicated BigQuery tools for functionalities like Forecasting and Anomaly
22+
Detection.
23+
24+
Use `execute_sql()` with the standard BigQuery `AI.*` functions for these tasks
25+
instead of the corresponding high-level tools.
26+
27+
## 2. Mandatory Reference Routing
28+
29+
This skill file does not contain the syntax for these functions. You **MUST**
30+
read the associated reference file before generating SQL.
31+
32+
**CRITICAL**: DO NOT GUESS filenames. You MUST only use the exact paths
33+
provided below.
34+
35+
| Function | Description | Required Reference File to Retrieve |
36+
| :--- | :--- | :--- |
37+
| **AI.FORECAST** | Time-series forecasting via the pre-trained TimesFM model | `references/bigquery_ai_forecast.md` |
38+
| **AI.CLASSIFY** | Categorize unstructured data into predefined labels | `references/bigquery_ai_classify.md` |
39+
| **AI.DETECT_ANOMALIES** | Identify deviations in time-series data via the pre-trained TimesFM model | `references/bigquery_ai_detect_anomalies.md` |
40+
| **AI.GENERATE** | General-purpose text and content generation | `references/bigquery_ai_generate.md` |
41+
| **AI.GENERATE_BOOL** | Generate a boolean value (TRUE/FALSE) based on a prompt | `references/bigquery_ai_generate_bool.md` |
42+
| **AI.GENERATE_DOUBLE** | Generate a floating-point number based on a prompt | `references/bigquery_ai_generate_double.md` |
43+
| **AI.GENERATE_INT** | Generate an integer value based on a prompt | `references/bigquery_ai_generate_int.md` |
44+
| **AI.IF** | Evaluate a natural-language boolean condition | `references/bigquery_ai_if.md` |
45+
| **AI.SCORE** | Rank items by semantic relevance (use with ORDER BY) | `references/bigquery_ai_score.md` |
46+
| **AI.SIMILARITY** | Compute cosine similarity between two inputs | `references/bigquery_ai_similarity.md` |
47+
| **AI.SEARCH** | Semantic search on tables with autonomous embedding generation | `references/bigquery_ai_search.md` |
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# BigQuery AI.Classify
2+
3+
`AI.CLASSIFY` categorizes unstructured data into a predefined set of labels.
4+
5+
## Syntax Reference
6+
7+
```sql
8+
AI.CLASSIFY(
9+
[ input => ] 'INPUT',
10+
[ categories => ] 'CATEGORIES'
11+
[, connection_id => 'CONNECTION_ID' ]
12+
[, endpoint => 'ENDPOINT' ]
13+
[, output_mode => 'OUTPUT_MODE' ]
14+
)
15+
```
16+
17+
### Input Arguments
18+
19+
| Argument | Requirement | Type | Description |
20+
| :------------------ | :----------- | :------------ | :-------------------- |
21+
| **`input`** | **Required** | String | The text content to |
22+
: : : : classify. :
23+
| **`categories`** | **Required** | Array<String> | A list of target |
24+
: : : : categories/labels. :
25+
: : : : Can be :
26+
: : : : `ARRAY<STRING>` or :
27+
: : : : `ARRAY<STRUCT<STRING, :
28+
: : : : STRING>>` (label, :
29+
: : : : description). :
30+
| **`connection_id`** | Optional | String | The connection ID to |
31+
: : : : use for the LLM. :
32+
| **`endpoint`** | Optional | String | The model name, e.g., |
33+
: : : : `'gemini-2.5-flash'`. :
34+
| **`output_mode`** | Optional | String | `'single'` (default) |
35+
: : : : or `'multi'`. :
36+
: : : : Determines the output :
37+
: : : : type. :
38+
39+
### Output Schema
40+
41+
The output type depends on the `output_mode` argument:
42+
43+
| Output Mode | output_mode Value | Type | Description |
44+
| :--------------- | :---------------- | :-------------- | :------------------ |
45+
| **Single Label** | `NULL` (Default) | `STRING` | The single category |
46+
: : : : that best fits the :
47+
: : : : input. :
48+
| **Single Label | `'single'` | `ARRAY<STRING>` | An array containing |
49+
: (Explicit)** : : : exactly one :
50+
: : : : category string. :
51+
| **Multi Label** | `'multi'` | `ARRAY<STRING>` | An array containing |
52+
: : : : zero or more :
53+
: : : : matching :
54+
: : : : categories. :
55+
56+
## Examples
57+
58+
### Classify text into categories
59+
60+
```sql
61+
SELECT
62+
content,
63+
AI.CLASSIFY(
64+
content,
65+
categories => ['Spam', 'Not Spam', 'Urgent'],
66+
connection_id => 'my-project.us.my-connection'
67+
) as classification
68+
FROM `dataset.emails`;
69+
```
70+
71+
### Classify text into multiple topics
72+
73+
```
74+
SELECT
75+
title,
76+
body,
77+
AI.CLASSIFY(
78+
body,
79+
categories => ['tech', 'sport', 'business', 'politics', 'entertainment', 'other'],
80+
output_mode => 'multi') AS categories
81+
FROM
82+
`bigquery-public-data.bbc_news.fulltext`
83+
LIMIT 100;
84+
```
85+
86+
### Classify reviews by sentiment
87+
88+
SELECT AI.CLASSIFY( ('Classify the review by sentiment: ', review), categories
89+
=> [('green', 'The review is positive.'), ('yellow', 'The review is neutral.'),
90+
('red', 'The review is negative.')]) AS ai_review_rating, reviewer_rating AS
91+
human_provided_rating, review, FROM `bigquery-public-data.imdb.reviews` WHERE
92+
title = 'The English Patient'
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# BigQuery AI.Detect_Anomalies
2+
3+
`AI.DETECT_ANOMALIES` uses the pre-trained **TimesFM** model to identify
4+
deviations in time series data without needing to train a custom model.
5+
6+
## Syntax Reference
7+
8+
This function compares a target dataset against a historical dataset to identify
9+
anomalies.
10+
11+
```sql
12+
SELECT *
13+
FROM AI.DETECT_ANOMALIES(
14+
{ TABLE `project.dataset.history_table` | (SELECT * FROM history_query) },
15+
{ TABLE `project.dataset.target_table` | (SELECT * FROM target_query) },
16+
data_col => 'DATA_COL',
17+
timestamp_col => 'TIMESTAMP_COL'
18+
[, model => 'MODEL']
19+
[, id_cols => ID_COLS]
20+
[, anomaly_prob_threshold => ANOMALY_PROB_THRESHOLD]
21+
)
22+
23+
```
24+
25+
### Input Arguments
26+
27+
Argument | Requirement | Type | Description
28+
:--------------------------- | :----------- | :------------ | :----------
29+
**`historical_data`** | **Required** | Table/Query | The source table or subquery containing historical data for training context.
30+
**`target_data`** | **Required** | Table/Query | The source table or subquery containing data to analyze for anomalies.
31+
**`data_col`** | **Required** | String | The numeric column to analyze.
32+
**`timestamp_col`** | **Required** | String | The column containing dates/timestamps.
33+
**`id_cols`** | Optional | Array<String> | Grouping columns for multiple series (e.g., `['store_id']`).
34+
**`anomaly_prob_threshold`** | Optional | Float64 | Threshold for anomaly detection (0 to 1). Defaults to 0.95.
35+
**`model`** | Optional | String | Model version. Defaults to `'TimesFM 2.0'`.
36+
37+
### Output Schema
38+
39+
| Column | Type | Description |
40+
| :------------------------------- | :--------- | :--------------------------- |
41+
| **`id_cols`** | (As Input) | Original identifiers for the |
42+
: : : series. :
43+
| **`time_series_timestamp`** | TIMESTAMP | Timestamp for the analyzed |
44+
: : : points. :
45+
| **`time_series_data`** | FLOAT64 | The original data value. |
46+
| **`is_anomaly`** | BOOL | TRUE if the point is |
47+
: : : identified as an anomaly. :
48+
| **`lower_bound`** | FLOAT64 | Lower bound of the expected |
49+
: : : range. :
50+
| **`upper_bound`** | FLOAT64 | Upper bound of the expected |
51+
: : : range. :
52+
| **`anomaly_probability`** | FLOAT64 | Probability that the point |
53+
: : : is an anomaly. :
54+
| **`ai_detect_anomalies_status`** | STRING | Error messages or empty |
55+
: : : string on success. A minimum :
56+
: : : of 3 data points is :
57+
: : : required. :
58+
59+
## Examples
60+
61+
### Basic Anomaly Detection
62+
63+
Detect anomalies in daily bike trips for a specific 2-month window based on
64+
prior history.
65+
66+
```sql
67+
WITH bike_trips AS (
68+
SELECT EXTRACT(DATE FROM starttime) AS date, COUNT(*) AS num_trips
69+
FROM `bigquery-public-data.new_york.citibike_trips`
70+
GROUP BY date
71+
)
72+
SELECT *
73+
FROM AI.DETECT_ANOMALIES(
74+
-- Historical context (Training data equivalent)
75+
(SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')),
76+
-- Target range (Data to inspect for anomalies)
77+
(SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'),
78+
data_col => 'num_trips',
79+
timestamp_col => 'date'
80+
);
81+
82+
```
83+
84+
### Multivariate Detection (Multiple Series)
85+
86+
Use `id_cols` to detect anomalies separately for different user types (e.g.,
87+
Subscriber vs. Customer) in the same query.
88+
89+
```sql
90+
WITH bike_trips AS (
91+
SELECT
92+
EXTRACT(DATE FROM starttime) AS date, usertype, gender,
93+
COUNT(*) AS num_trips
94+
FROM `bigquery-public-data.new_york.citibike_trips`
95+
GROUP BY date, usertype, gender
96+
)
97+
SELECT *
98+
FROM
99+
AI.DETECT_ANOMALIES(
100+
# Historical data from a query
101+
(SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')),
102+
# Target data from a query
103+
(SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'),
104+
data_col => 'num_trips',
105+
timestamp_col => 'date',
106+
id_cols => ['usertype', 'gender'],
107+
model => "TimesFM 2.5",
108+
anomaly_prob_threshold => 0.8);
109+
110+
```

0 commit comments

Comments
 (0)