Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions BIGQUERY.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,15 @@ This section covers connecting to BigQuery.
* If an operation fails due to permissions, identify the type of operation and recommend the appropriate role. You can provide these links for assistance:
* Granting Roles: https://cloud.google.com/iam/docs/grant-role-console
* BigQuery Permissions: https://cloud.google.com/iam/docs/roles-permissions/bigquery

### 2. BigQuery AI/ML Skills
These skills leverage BigQuery's built-in AI functions (`AI.*`) for tasks like text generation, classification, and semantic search.

**Important**: Standard SQL-based `AI.*` functions (executed via `execute_sql()`) are preferred over dedicated BigQuery tools for tasks like Forecasting and Anomaly Detection.

1. **Prerequisites**:
* Ensure your BigQuery project has the **Vertex AI API** enabled.
* A [Cloud Resource Connection](https://docs.cloud.google.com/bigquery/docs/create-cloud-resource-connection) must be established in BigQuery to use `AI.*` functions.

2. **Handle Permission Errors**:
* The service account associated with the BigQuery connection requires the **Vertex AI User** (`roles/aiplatform.user`) and the **BigQuery Connection User** (`roles/bigquery.connectionUser`) role.
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,11 @@ Before you begin, ensure you have the following:
- Ensure [Application Default Credentials](https://cloud.google.com/docs/authentication/gcloud) are available in your environment.
- IAM Permissions:
- BigQuery User (`roles/bigquery.user`)
- (Optional) To use BigQuery AI/ML skills
- Ensure that Vertex AI API is enabled
- IAM permissions:
- BigQuery Connection User (`roles/bigquery.connectionUser`)
- Vertex AI User (`roles/aiplatform.user`)

## Getting Started

Expand Down Expand Up @@ -235,8 +240,9 @@ Interact with BigQuery using natural language right from your IDE:

This extension provides a comprehensive set of skills:

* `bigquery-data`: Use these skills when you need to handle large-scale data exploration and dataset management. Use when users need to find data assets or run SQL at scale. Provides metadata discovery and query execution across the data warehouse.
* `bigquery-analytics`: Use these skills when you need to handle advanced data intelligence and predictive tasks. Use when a user asks "why" data changed or needs future projections. Provides automated insight generation and time-series forecasting.
* [bigquery-data](./skills/bigquery-data/SKILL.md): Use these skills when you need to handle large-scale data exploration and dataset management. Use when users need to find data assets or run SQL at scale. Provides metadata discovery and query execution across the data warehouse.
* [bigquery-analytics](./skills/bigquery-analytics/SKILL.md): Use these skills when you need to handle advanced data intelligence and predictive tasks. Use when a user asks "why" data changed or needs future projections. Provides automated insight generation and time-series forecasting.
* [bigquery-ai-ml](./skills/bigquery-ai-ml/SKILL.md): Use these skills for BigQuery AI and Machine Learning queries using standard SQL and `AI.*` functions. Provides capabilities for text generation, classification, semantic search, and forecasting using pre-trained models without needing to manage custom models.

## Additional Extensions

Expand Down
47 changes: 47 additions & 0 deletions skills/bigquery-ai-ml/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
name: bigquery-ai-ml
license: Apache-2.0
metadata:
author: google-adk
version: "1.0"
description: |
Skill for BigQuery AI and Machine Learning queries using standard SQL
and `AI.*` functions (preferred over dedicated tools).

---

# Skill: bigquery-ai-ml

This skill defines the usage and rules for BigQuery AI/ML functions,
preferring SQL-based Skills over dedicated BigQuery tools.

## 1. Skill vs Tool Preference (BigQuery AI/ML)

Agents should **prefer using the Skill (SQL via `execute_sql()`)** over
dedicated BigQuery tools for functionalities like Forecasting and Anomaly
Detection.

Use `execute_sql()` with the standard BigQuery `AI.*` functions for these tasks
instead of the corresponding high-level tools.

## 2. Mandatory Reference Routing

This skill file does not contain the syntax for these functions. You **MUST**
read the associated reference file before generating SQL.

**CRITICAL**: DO NOT GUESS filenames. You MUST only use the exact paths
provided below.

| Function | Description | Required Reference File to Retrieve |
| :--- | :--- | :--- |
| **AI.FORECAST** | Time-series forecasting via the pre-trained TimesFM model | `references/bigquery_ai_forecast.md` |
| **AI.CLASSIFY** | Categorize unstructured data into predefined labels | `references/bigquery_ai_classify.md` |
| **AI.DETECT_ANOMALIES** | Identify deviations in time-series data via the pre-trained TimesFM model | `references/bigquery_ai_detect_anomalies.md` |
| **AI.GENERATE** | General-purpose text and content generation | `references/bigquery_ai_generate.md` |
| **AI.GENERATE_BOOL** | Generate a boolean value (TRUE/FALSE) based on a prompt | `references/bigquery_ai_generate_bool.md` |
| **AI.GENERATE_DOUBLE** | Generate a floating-point number based on a prompt | `references/bigquery_ai_generate_double.md` |
| **AI.GENERATE_INT** | Generate an integer value based on a prompt | `references/bigquery_ai_generate_int.md` |
| **AI.IF** | Evaluate a natural-language boolean condition | `references/bigquery_ai_if.md` |
| **AI.SCORE** | Rank items by semantic relevance (use with ORDER BY) | `references/bigquery_ai_score.md` |
| **AI.SIMILARITY** | Compute cosine similarity between two inputs | `references/bigquery_ai_similarity.md` |
| **AI.SEARCH** | Semantic search on tables with autonomous embedding generation | `references/bigquery_ai_search.md` |
92 changes: 92 additions & 0 deletions skills/bigquery-ai-ml/references/bigquery_ai_classify.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# BigQuery AI.Classify

`AI.CLASSIFY` categorizes unstructured data into a predefined set of labels.

## Syntax Reference

```sql
AI.CLASSIFY(
[ input => ] 'INPUT',
[ categories => ] 'CATEGORIES'
[, connection_id => 'CONNECTION_ID' ]
[, endpoint => 'ENDPOINT' ]
[, output_mode => 'OUTPUT_MODE' ]
)
```

### Input Arguments

| Argument | Requirement | Type | Description |
| :------------------ | :----------- | :------------ | :-------------------- |
| **`input`** | **Required** | String | The text content to |
: : : : classify. :
| **`categories`** | **Required** | Array<String> | A list of target |
: : : : categories/labels. :
: : : : Can be :
: : : : `ARRAY<STRING>` or :
: : : : `ARRAY<STRUCT<STRING, :
: : : : STRING>>` (label, :
: : : : description). :
| **`connection_id`** | Optional | String | The connection ID to |
: : : : use for the LLM. :
| **`endpoint`** | Optional | String | The model name, e.g., |
: : : : `'gemini-2.5-flash'`. :
| **`output_mode`** | Optional | String | `'single'` (default) |
: : : : or `'multi'`. :
: : : : Determines the output :
: : : : type. :

### Output Schema

The output type depends on the `output_mode` argument:

| Output Mode | output_mode Value | Type | Description |
| :--------------- | :---------------- | :-------------- | :------------------ |
| **Single Label** | `NULL` (Default) | `STRING` | The single category |
: : : : that best fits the :
: : : : input. :
| **Single Label | `'single'` | `ARRAY<STRING>` | An array containing |
: (Explicit)** : : : exactly one :
: : : : category string. :
| **Multi Label** | `'multi'` | `ARRAY<STRING>` | An array containing |
: : : : zero or more :
: : : : matching :
: : : : categories. :

## Examples

### Classify text into categories

```sql
SELECT
content,
AI.CLASSIFY(
content,
categories => ['Spam', 'Not Spam', 'Urgent'],
connection_id => 'my-project.us.my-connection'
) as classification
FROM `dataset.emails`;
```

### Classify text into multiple topics

```
SELECT
title,
body,
AI.CLASSIFY(
body,
categories => ['tech', 'sport', 'business', 'politics', 'entertainment', 'other'],
output_mode => 'multi') AS categories
FROM
`bigquery-public-data.bbc_news.fulltext`
LIMIT 100;
```

### Classify reviews by sentiment

SELECT AI.CLASSIFY( ('Classify the review by sentiment: ', review), categories
=> [('green', 'The review is positive.'), ('yellow', 'The review is neutral.'),
('red', 'The review is negative.')]) AS ai_review_rating, reviewer_rating AS
human_provided_rating, review, FROM `bigquery-public-data.imdb.reviews` WHERE
title = 'The English Patient'
110 changes: 110 additions & 0 deletions skills/bigquery-ai-ml/references/bigquery_ai_detect_anomalies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# BigQuery AI.Detect_Anomalies

`AI.DETECT_ANOMALIES` uses the pre-trained **TimesFM** model to identify
deviations in time series data without needing to train a custom model.

## Syntax Reference

This function compares a target dataset against a historical dataset to identify
anomalies.

```sql
SELECT *
FROM AI.DETECT_ANOMALIES(
{ TABLE `project.dataset.history_table` | (SELECT * FROM history_query) },
{ TABLE `project.dataset.target_table` | (SELECT * FROM target_query) },
data_col => 'DATA_COL',
timestamp_col => 'TIMESTAMP_COL'
[, model => 'MODEL']
[, id_cols => ID_COLS]
[, anomaly_prob_threshold => ANOMALY_PROB_THRESHOLD]
)

```

### Input Arguments

Argument | Requirement | Type | Description
:--------------------------- | :----------- | :------------ | :----------
**`historical_data`** | **Required** | Table/Query | The source table or subquery containing historical data for training context.
**`target_data`** | **Required** | Table/Query | The source table or subquery containing data to analyze for anomalies.
**`data_col`** | **Required** | String | The numeric column to analyze.
**`timestamp_col`** | **Required** | String | The column containing dates/timestamps.
**`id_cols`** | Optional | Array<String> | Grouping columns for multiple series (e.g., `['store_id']`).
**`anomaly_prob_threshold`** | Optional | Float64 | Threshold for anomaly detection (0 to 1). Defaults to 0.95.
**`model`** | Optional | String | Model version. Defaults to `'TimesFM 2.0'`.

### Output Schema

| Column | Type | Description |
| :------------------------------- | :--------- | :--------------------------- |
| **`id_cols`** | (As Input) | Original identifiers for the |
: : : series. :
| **`time_series_timestamp`** | TIMESTAMP | Timestamp for the analyzed |
: : : points. :
| **`time_series_data`** | FLOAT64 | The original data value. |
| **`is_anomaly`** | BOOL | TRUE if the point is |
: : : identified as an anomaly. :
| **`lower_bound`** | FLOAT64 | Lower bound of the expected |
: : : range. :
| **`upper_bound`** | FLOAT64 | Upper bound of the expected |
: : : range. :
| **`anomaly_probability`** | FLOAT64 | Probability that the point |
: : : is an anomaly. :
| **`ai_detect_anomalies_status`** | STRING | Error messages or empty |
: : : string on success. A minimum :
: : : of 3 data points is :
: : : required. :

## Examples

### Basic Anomaly Detection

Detect anomalies in daily bike trips for a specific 2-month window based on
prior history.

```sql
WITH bike_trips AS (
SELECT EXTRACT(DATE FROM starttime) AS date, COUNT(*) AS num_trips
FROM `bigquery-public-data.new_york.citibike_trips`
GROUP BY date
)
SELECT *
FROM AI.DETECT_ANOMALIES(
-- Historical context (Training data equivalent)
(SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')),
-- Target range (Data to inspect for anomalies)
(SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'),
data_col => 'num_trips',
timestamp_col => 'date'
);

```

### Multivariate Detection (Multiple Series)

Use `id_cols` to detect anomalies separately for different user types (e.g.,
Subscriber vs. Customer) in the same query.

```sql
WITH bike_trips AS (
SELECT
EXTRACT(DATE FROM starttime) AS date, usertype, gender,
COUNT(*) AS num_trips
FROM `bigquery-public-data.new_york.citibike_trips`
GROUP BY date, usertype, gender
)
SELECT *
FROM
AI.DETECT_ANOMALIES(
# Historical data from a query
(SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')),
# Target data from a query
(SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'),
data_col => 'num_trips',
timestamp_col => 'date',
id_cols => ['usertype', 'gender'],
model => "TimesFM 2.5",
anomaly_prob_threshold => 0.8);

```
Loading
Loading