Skip to content

Commit 5e3ade2

Browse files
committed
feat: Add AUDIT_ONLY model kind for multi-table validation
Introduces a new model kind that validates data relationships across multiple tables without materializing results. Combines model benefits (DAG participation, dependencies) with audit behavior (validation only). - Add AUDIT_ONLY to ModelKindName enum and create AuditOnlyKind class - Implement AuditOnlyStrategy for execution without materialization - Add comprehensive unit and integration tests - Update documentation with usage examples and best practices - Add three example models to sushi project demonstrating use cases
1 parent 3bef91a commit 5e3ade2

12 files changed

Lines changed: 889 additions & 3 deletions

File tree

docs/concepts/audits.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -722,3 +722,104 @@ MODEL (
722722
)
723723
);
724724
```
725+
726+
### AUDIT_ONLY Models
727+
728+
In addition to traditional audits, SQLMesh provides a special model kind called `AUDIT_ONLY` for validating data relationships across multiple tables without materializing any results.
729+
730+
#### When to Use AUDIT_ONLY Models
731+
732+
Use `AUDIT_ONLY` models when you need to:
733+
- Validate relationships between multiple tables (e.g., referential integrity)
734+
- Run complex validation queries that don't belong to a single model
735+
- Create validation logic that participates in the model DAG with proper dependencies
736+
- Avoid creating unnecessary materialized tables just for validation
737+
738+
Unlike traditional audits that are scoped to a single model, `AUDIT_ONLY` models can depend on multiple models and validate relationships between them.
739+
740+
#### Creating AUDIT_ONLY Models
741+
742+
AUDIT_ONLY models are defined like regular models but with `kind AUDIT_ONLY`:
743+
744+
```sql
745+
MODEL (
746+
name data_quality.order_validation,
747+
kind AUDIT_ONLY (
748+
blocking TRUE, -- Fail pipeline if validation fails (default)
749+
max_failing_rows 20 -- Number of sample rows to show in error (default: 10)
750+
),
751+
depends_on [orders, customers],
752+
cron '@daily',
753+
owner 'data_quality_team'
754+
);
755+
756+
-- Query should return 0 rows for success
757+
-- Any returned rows indicate validation failures
758+
SELECT
759+
o.order_id,
760+
o.customer_id,
761+
'Missing customer record' as issue_type
762+
FROM orders o
763+
LEFT JOIN customers c ON o.customer_id = c.customer_id
764+
WHERE c.customer_id IS NULL;
765+
```
766+
767+
#### Key Differences from Regular Audits
768+
769+
| Feature | Traditional Audits | AUDIT_ONLY Models |
770+
|---------|-------------------|-------------------|
771+
| **Scope** | Single model | Multiple models |
772+
| **Dependencies** | Implicit (via @this_model) | Explicit (via depends_on) |
773+
| **Materialization** | N/A | Never materializes |
774+
| **Location** | `audits/` directory or inline | `models/` directory |
775+
| **Scheduling** | With parent model | Independent cron schedule |
776+
| **DAG Participation** | Attached to model | Full model in DAG |
777+
778+
#### Configuration Options
779+
780+
AUDIT_ONLY models support these configuration options:
781+
782+
- **`blocking`** (default: `TRUE`): Whether validation failures should stop the pipeline
783+
- **`max_failing_rows`** (default: `10`): Maximum number of failing rows to show in error messages
784+
785+
Example with non-blocking configuration:
786+
787+
```sql
788+
MODEL (
789+
name data_quality.revenue_anomalies,
790+
kind AUDIT_ONLY (
791+
blocking FALSE, -- Log warnings but don't stop pipeline
792+
max_failing_rows 50 -- Show up to 50 failing rows
793+
),
794+
depends_on [revenue_by_day]
795+
);
796+
797+
-- Detect revenue anomalies
798+
WITH stats AS (
799+
SELECT AVG(revenue) as avg_rev, STDDEV(revenue) as stddev_rev
800+
FROM revenue_by_day
801+
)
802+
SELECT
803+
day,
804+
revenue,
805+
'Anomaly: >3 standard deviations' as issue
806+
FROM revenue_by_day
807+
CROSS JOIN stats
808+
WHERE revenue > avg_rev + (3 * stddev_rev)
809+
OR revenue < 0;
810+
```
811+
812+
#### How AUDIT_ONLY Models Work
813+
814+
1. **No Table Creation**: The model's query executes but doesn't create or update any tables
815+
2. **Validation Logic**: The model fails if the query returns any rows (similar to audits)
816+
3. **Error Reporting**: Shows a sample of failing rows in the error message
817+
4. **Pipeline Integration**: Participates in plan/apply workflow with proper dependency ordering
818+
819+
#### Best Practices
820+
821+
1. **Use descriptive names**: Name your AUDIT_ONLY models clearly (e.g., `audit_order_integrity`, `validate_user_consistency`)
822+
2. **Set appropriate blocking**: Use `blocking TRUE` for critical validations, `FALSE` for warnings
823+
3. **Include context in output**: Return columns that help identify and debug issues
824+
4. **Group related validations**: Consider combining related checks in a single AUDIT_ONLY model
825+
5. **Document validation logic**: Use model descriptions to explain what's being validated and why

docs/concepts/models/model_kinds.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -860,6 +860,115 @@ SELECT DISTINCT
860860
FROM db.employees;
861861
```
862862

863+
## AUDIT_ONLY
864+
865+
The `AUDIT_ONLY` model kind is designed for data validation across multiple tables without materializing any results. These models execute validation queries and fail if any rows are returned, similar to [audits](../audits.md#audit_only-models) but with the ability to participate as full models in the DAG.
866+
867+
### Purpose
868+
869+
`AUDIT_ONLY` models are ideal for:
870+
- Validating referential integrity between multiple tables
871+
- Detecting data quality issues across different models
872+
- Running complex validation queries that don't belong to a single model
873+
- Avoiding unnecessary table materialization for validation purposes
874+
875+
### Configuration
876+
877+
The `AUDIT_ONLY` kind supports two configuration parameters:
878+
879+
- **`blocking`** (default: `TRUE`): Determines whether validation failures stop the pipeline
880+
- **`max_failing_rows`** (default: `10`): Maximum number of failing rows to display in error messages
881+
882+
### Example: Referential Integrity Check
883+
884+
This example validates that all orders reference existing customers:
885+
886+
```sql linenums="1"
887+
MODEL (
888+
name data_quality.order_integrity,
889+
kind AUDIT_ONLY (
890+
blocking TRUE,
891+
max_failing_rows 20
892+
),
893+
depends_on [orders, customers],
894+
cron '@daily',
895+
owner 'data_quality_team'
896+
);
897+
898+
-- Query should return 0 rows for validation to pass
899+
SELECT
900+
o.order_id,
901+
o.customer_id,
902+
o.order_date,
903+
'Missing customer record' as issue_type
904+
FROM orders o
905+
LEFT JOIN customers c ON o.customer_id = c.customer_id
906+
WHERE c.customer_id IS NULL;
907+
```
908+
909+
### Example: Non-Blocking Anomaly Detection
910+
911+
This example detects revenue anomalies but doesn't stop the pipeline:
912+
913+
```sql linenums="1"
914+
MODEL (
915+
name data_quality.revenue_anomalies,
916+
kind AUDIT_ONLY (
917+
blocking FALSE, -- Log warnings but continue
918+
max_failing_rows 100
919+
),
920+
depends_on [daily_revenue],
921+
cron '@hourly'
922+
);
923+
924+
WITH stats AS (
925+
SELECT
926+
AVG(revenue) as avg_revenue,
927+
STDDEV(revenue) as stddev_revenue
928+
FROM daily_revenue
929+
WHERE revenue > 0
930+
)
931+
SELECT
932+
date,
933+
revenue,
934+
CASE
935+
WHEN revenue < 0 THEN 'Negative revenue'
936+
WHEN revenue > avg_revenue + (5 * stddev_revenue) THEN 'Extreme outlier'
937+
END as anomaly_type
938+
FROM daily_revenue
939+
CROSS JOIN stats
940+
WHERE revenue < 0
941+
OR revenue > avg_revenue + (5 * stddev_revenue);
942+
```
943+
944+
### Behavior
945+
946+
1. **No Materialization**: AUDIT_ONLY models never create or update tables
947+
2. **Validation Logic**: The model succeeds if the query returns 0 rows, fails otherwise
948+
3. **Error Reporting**: When validation fails, shows a sample of failing rows (up to `max_failing_rows`)
949+
4. **DAG Integration**: Fully participates in the model DAG with proper dependency tracking
950+
5. **Scheduling**: Can be scheduled independently using cron expressions
951+
952+
### Best Practices
953+
954+
- **Naming Convention**: Use descriptive names like `audit_*` or `validate_*` to clearly indicate the model's purpose
955+
- **Include Context**: Add columns that describe what validation failed for easier debugging
956+
- **Optimize Performance**: These queries run during every plan/apply, so ensure they're efficient
957+
- **Set Appropriate Blocking**: Use `blocking TRUE` for critical validations, `FALSE` for monitoring
958+
- **Document Purpose**: Use the `description` field to explain what the validation checks
959+
960+
### Comparison with Traditional Audits
961+
962+
While both AUDIT_ONLY models and traditional audits validate data, they serve different purposes:
963+
964+
| Aspect | Traditional Audits | AUDIT_ONLY Models |
965+
|--------|-------------------|-------------------|
966+
| **Scope** | Single model | Multiple models |
967+
| **Location** | `audits/` directory or inline | `models/` directory |
968+
| **Dependencies** | Implicit via parent model | Explicit via `depends_on` |
969+
| **Scheduling** | With parent model | Independent cron |
970+
| **Use Case** | Validate model output | Validate cross-model relationships |
971+
863972
## SEED
864973
The `SEED` model kind is used to specify [seed models](./seed_models.md) for using static CSV datasets in your SQLMesh project.
865974

docs/reference/model_configuration.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -306,6 +306,17 @@ Configuration options for [`SCD_TYPE_2_BY_COLUMN` models](../concepts/models/mod
306306

307307
Python model kind `name` enum value: [ModelKindName.SCD_TYPE_2_BY_COLUMN](https://sqlmesh.readthedocs.io/en/stable/_readthedocs/html/sqlmesh/core/model/kind.html#ModelKindName)
308308

309+
### `AUDIT_ONLY` models
310+
311+
Configuration options for [`AUDIT_ONLY` models](../concepts/models/model_kinds.md#audit_only) (in addition to [general model properties](#general-model-properties)).
312+
313+
| Option | Description | Type | Required |
314+
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--: | :------: |
315+
| `blocking` | If set to true, the pipeline will fail when the validation query returns any rows. If false, only warnings are logged. (Default: `True`) | bool | N |
316+
| `max_failing_rows` | Maximum number of failing rows to display in error messages when a validation fails. (Default: `10`) | int | N |
317+
318+
Python model kind `name` enum value: [ModelKindName.AUDIT_ONLY](https://sqlmesh.readthedocs.io/en/stable/_readthedocs/html/sqlmesh/core/model/kind.html#ModelKindName)
319+
309320
### `SEED` models
310321

311322
Configuration options for [`SEED` models](../concepts/models/model_kinds.md#seed). `SEED` models do not support all the general properties supported by other models; they only support the properties listed in this table.
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
MODEL (
2+
name sushi.audit_duplicate_orders,
3+
kind AUDIT_ONLY (
4+
blocking FALSE,
5+
max_failing_rows 100
6+
),
7+
depends_on [sushi.orders],
8+
cron '@hourly',
9+
owner 'data_engineering',
10+
tags ['validation', 'duplicates', 'data_quality'],
11+
description 'Detects potential duplicate orders based on customer, waiter, and timing'
12+
);
13+
14+
-- Find potential duplicate orders
15+
-- Orders from the same customer to the same waiter within 5 minutes might be duplicates
16+
WITH potential_duplicates AS (
17+
SELECT
18+
o1.id as order_id_1,
19+
o2.id as order_id_2,
20+
o1.customer_id,
21+
o1.waiter_id,
22+
o1.start_ts as order_1_time,
23+
o2.start_ts as order_2_time,
24+
ABS(o1.start_ts - o2.start_ts) as seconds_apart
25+
FROM sushi.orders o1
26+
INNER JOIN sushi.orders o2
27+
ON o1.customer_id = o2.customer_id
28+
AND o1.waiter_id = o2.waiter_id
29+
AND o1.id < o2.id -- Avoid comparing order with itself and duplicating pairs
30+
AND o1.event_date = o2.event_date -- Same day
31+
WHERE ABS(o1.start_ts - o2.start_ts) <= 300 -- Within 5 minutes (300 seconds)
32+
)
33+
SELECT
34+
order_id_1,
35+
order_id_2,
36+
customer_id,
37+
waiter_id,
38+
seconds_apart,
39+
CONCAT('Orders ', order_id_1::TEXT, ' and ', order_id_2::TEXT,
40+
' from customer ', customer_id::TEXT,
41+
' are only ', seconds_apart::TEXT, ' seconds apart') as issue_description
42+
FROM potential_duplicates
43+
ORDER BY seconds_apart, order_id_1
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
MODEL (
2+
name sushi.audit_order_integrity,
3+
kind AUDIT_ONLY (
4+
blocking FALSE, -- Set to non-blocking for example/demo purposes
5+
max_failing_rows 20
6+
),
7+
depends_on [sushi.orders, sushi.customers],
8+
cron '@daily',
9+
owner 'data_quality_team',
10+
tags ['validation', 'referential_integrity', 'critical'],
11+
description 'Validates referential integrity between orders and customers tables'
12+
);
13+
14+
-- Check for orders with non-existent customer IDs
15+
-- This should return no rows if all orders have valid customers
16+
SELECT
17+
o.id as order_id,
18+
o.customer_id,
19+
o.event_date,
20+
'Missing customer record' as issue_type,
21+
CONCAT('Order ', o.id::TEXT, ' references non-existent customer ', o.customer_id::TEXT) as issue_description
22+
FROM sushi.orders o
23+
LEFT JOIN sushi.customers c
24+
ON o.customer_id = c.customer_id
25+
WHERE c.customer_id IS NULL
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
MODEL (
2+
name sushi.audit_waiter_revenue_anomalies,
3+
kind AUDIT_ONLY (
4+
blocking FALSE,
5+
max_failing_rows 50
6+
),
7+
depends_on [sushi.waiter_revenue_by_day],
8+
cron '@daily',
9+
owner 'analytics_team',
10+
tags ['validation', 'revenue', 'daily'],
11+
description 'Detects anomalies in daily waiter revenue that may indicate data quality issues'
12+
);
13+
14+
-- Detect anomalies in waiter daily revenue
15+
-- Only flag extreme outliers (>5 std dev) or negative revenue
16+
WITH revenue_stats AS (
17+
SELECT
18+
AVG(revenue) as avg_revenue,
19+
STDDEV(revenue) as stddev_revenue
20+
FROM sushi.waiter_revenue_by_day
21+
WHERE revenue > 0 -- Exclude zeros from stats calculation
22+
),
23+
anomalies AS (
24+
SELECT
25+
w.waiter_id,
26+
w.event_date,
27+
w.revenue,
28+
r.avg_revenue,
29+
r.stddev_revenue,
30+
CASE
31+
WHEN w.revenue < 0 THEN 'Negative revenue'
32+
WHEN w.revenue > r.avg_revenue + (5 * r.stddev_revenue) THEN 'Extremely high revenue (>5 std dev)'
33+
END as anomaly_type
34+
FROM sushi.waiter_revenue_by_day w
35+
CROSS JOIN revenue_stats r
36+
WHERE
37+
w.revenue < 0
38+
OR w.revenue > r.avg_revenue + (5 * r.stddev_revenue) -- Only flag extreme outliers
39+
)
40+
SELECT
41+
waiter_id,
42+
event_date,
43+
revenue,
44+
anomaly_type,
45+
CONCAT('Waiter ', waiter_id::TEXT, ' has ', anomaly_type, ' on ', event_date::TEXT) as issue_description
46+
FROM anomalies
47+
ORDER BY event_date DESC, waiter_id

sqlmesh/core/dialect.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -621,6 +621,7 @@ def parse(self: Parser) -> t.Optional[exp.Expression]:
621621
ModelKindName.SCD_TYPE_2_BY_TIME,
622622
ModelKindName.SCD_TYPE_2_BY_COLUMN,
623623
ModelKindName.CUSTOM,
624+
ModelKindName.AUDIT_ONLY,
624625
) and self._match(TokenType.L_PAREN, advance=False):
625626
props = self._parse_wrapped_csv(functools.partial(_parse_props, self))
626627
else:

0 commit comments

Comments
 (0)