perf: Job acquisition performance — upgrade get_job, fix queue_name default, add partial indexes

## Problem

The `@pgpm/jobs` and `@pgpm/database-jobs` packages (v0.15.5) have three compounding performance issues in the job acquisition path that, together, produce the worst-case scenario documented in upstream [Graphile Worker issue #240](https://github.com/graphile/worker/issues/240). Under queue contention, **job acquisition can be 20x–50x slower than it needs to be**.

### 1. `queue_name` defaults to `gen_random_uuid()::text` instead of `NULL`

Both `jobs` and `scheduled_jobs` tables define:

```sql
queue_name text DEFAULT (public.gen_random_uuid())::text,
```

Upstream Graphile Worker made `queue_name` **nullable with no default** in v0.4.0 (migration 000003), describing it as a **"significant performance enhancement"** because it *"changes database schema such that a job_queue record is only added/checked when necessary."*

With the random UUID default, every job creates a unique queue — meaning:
- The `increase_job_queue_count` trigger fires on every insert, creating a `job_queues` row per job
- `get_job` must evaluate the expensive `EXISTS ... FOR UPDATE SKIP LOCKED` subquery on `job_queues` for every candidate row
- The `job_queues` table grows unboundedly with unique entries
- Queue bookkeeping runs at maximum cost for zero benefit

**Note**: The queue-count triggers already have `WHEN (NEW.queue_name IS NOT NULL)` guards — so the fix is just changing the default.

### 2. `get_job` uses the old row-by-row queue checking pattern (Strategy 1)

The current `get_job` checks queue eligibility per candidate job row:

```sql
AND (jobs.queue_name IS NULL
  OR EXISTS (
    SELECT 1 FROM app_jobs.job_queues
    WHERE job_queues.queue_name = jobs.queue_name
      AND (job_queues.locked_at IS NULL
        OR job_queues.locked_at < (v_now - job_expiry))
    FOR UPDATE SKIP LOCKED
  ))
```

This is **Strategy 1** (per-row EXISTS) in the upstream nomenclature. Upstream issue #240 documented this exact query going from **242ms to 13ms** (18x) when replaced with Strategy 2 (pre-computed eligible queue set). The upstream benchmarks show:

| Strategy | Description | Jobs/sec |
|---|---|---|
| 0 | Skip named queues entirely | 11,800 |
| **1** | **Per-row EXISTS (current pgpm approach)** | **~40** |
| **2** | **Pre-lock queues upfront (recommended)** | **843** |
| 3 | Exclude locked queues (unsafe) | 600 |

### 3. Legacy ordering index

The current index:
```sql
CREATE INDEX priority_run_at_id_idx ON app_jobs.jobs (priority, run_at, id);
```

Upstream v0.14 replaced this with partial covering indexes filtered on an `is_available` generated column:
```sql
CREATE INDEX jobs_main_index ON jobs USING btree (priority, run_at)
  INCLUDE (id, task_id, job_queue_id) WHERE (is_available = true);
```

The partial index excludes locked/exhausted jobs from the index entirely, making it smaller and faster to scan.

---

## Solution

### Phase 1: Quick Wins (High Impact, Low Risk)

**1a. Fix `queue_name` default to NULL**

In both `@pgpm/jobs` and `@pgpm/database-jobs`, change the `jobs` and `scheduled_jobs` table definitions:

```sql
-- Before:
queue_name text DEFAULT (public.gen_random_uuid())::text,

-- After:
queue_name text DEFAULT NULL,
```

Audit any downstream code that assumes `queue_name` is always non-null. The existing `get_job`, `complete_job`, `fail_job`, and triggers already handle NULL correctly.

**1b. Rewrite `get_job` with pre-computed queue set (Strategy 2)**

Replace the row-by-row `EXISTS` with a single pre-computed subquery:

```sql
-- Key change in the WHERE clause:
AND (jobs.queue_name IS NULL
  OR jobs.queue_name IN (
    SELECT jq.queue_name
    FROM app_jobs.job_queues jq
    WHERE (jq.locked_at IS NULL
        OR jq.locked_at < (v_now - job_expiry))
    FOR UPDATE SKIP LOCKED
  ))
```

The subquery is evaluated once and PostgreSQL uses a hashed semi-join, eliminating the per-row queue-lock check.

### Phase 2: Index Improvements

**2a. Add `is_available` generated column**

```sql
ALTER TABLE app_jobs.jobs ADD COLUMN is_available boolean
  GENERATED ALWAYS AS ((locked_at IS NULL) AND (attempts < max_attempts)) STORED;
```

**2b. Replace legacy index with partial covering indexes**

```sql
CREATE INDEX CONCURRENTLY jobs_main_index
  ON app_jobs.jobs USING btree (priority, run_at)
  INCLUDE (id, queue_name) WHERE (is_available = true);

CREATE INDEX CONCURRENTLY jobs_no_queue_index
  ON app_jobs.jobs USING btree (priority, run_at)
  INCLUDE (id) WHERE (is_available = true AND queue_name IS NULL);

-- After benchmarking:
DROP INDEX priority_run_at_id_idx;
```

### Phase 3: Modernization Features

- Add `revision integer DEFAULT 0` and `flags jsonb` columns
- Add `forbidden_flags` parameter to `get_job`
- Change notify trigger from `FOR EACH ROW` to `FOR EACH STATEMENT`
- Add `remove_job(job_key text)` function
- Add `force_unlock_workers(worker_ids text[])` function

### Phase 4: Optional — Full v0.14 Table Rebuild

- `tasks` lookup table with integer FK (replaces text `task_identifier` on hot path)
- `job_queue_id` integer FK (replaces text `queue_name` join)
- `smallint` for priority/attempts/max_attempts

Phase 4 is optional — the performance gains are marginal compared to Phases 1–2, and the migration risk is higher (especially preserving the `database_id` multi-tenant design).

---

## Expected Impact

| Change | Expected Improvement |
|---|---|
| Fix `queue_name` default | ~2–5x for `add_job` (eliminates queue bookkeeping) |
| Rewrite `get_job` (Strategy 2) | **~20x** acquisition speedup under contention |
| Partial indexes with `is_available` | ~2–3x for `get_job` on large tables |
| Combined Phase 1 + 2 | **Estimated 10–50x** throughput improvement |

## References

- [Graphile Worker issue #240: Improve performance of queue locking](https://github.com/graphile/worker/issues/240) — the 243ms → 13ms benchmark
- [Graphile Worker migration 000003](https://github.com/graphile/worker/blob/main/sql/000003.sql) — v0.4.0 queue_name nullable change
- [Graphile Worker migration 000011](https://github.com/graphile/worker/blob/main/sql/000011.sql) — v0.14 table rebuild with partial indexes
- [Graphile Worker getJobs.ts](https://github.com/graphile/worker/blob/main/src/sql/getJobs.ts) — Strategy 2 implementation and benchmarks

Change	Expected Improvement
Fix `queue_name` default	~2–5x for `add_job` (eliminates queue bookkeeping)
Rewrite `get_job` (Strategy 2)	~20x acquisition speedup under contention
Partial indexes with `is_available`	~2–3x for `get_job` on large tables
Combined Phase 1 + 2	Estimated 10–50x throughput improvement

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Job acquisition performance — upgrade get_job, fix queue_name default, add partial indexes #60

Problem

1. `queue_name` defaults to `gen_random_uuid()::text` instead of `NULL`

2. `get_job` uses the old row-by-row queue checking pattern (Strategy 1)

3. Legacy ordering index

Solution

Phase 1: Quick Wins (High Impact, Low Risk)

Phase 2: Index Improvements

Phase 3: Modernization Features

Phase 4: Optional — Full v0.14 Table Rebuild

Expected Impact

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Strategy	Description	Jobs/sec
0	Skip named queues entirely	11,800
1	Per-row EXISTS (current pgpm approach)	~40
2	Pre-lock queues upfront (recommended)	843
3	Exclude locked queues (unsafe)	600

perf: Job acquisition performance — upgrade get_job, fix queue_name default, add partial indexes #60

Description

Problem

1. queue_name defaults to gen_random_uuid()::text instead of NULL

2. get_job uses the old row-by-row queue checking pattern (Strategy 1)

3. Legacy ordering index

Solution

Phase 1: Quick Wins (High Impact, Low Risk)

Phase 2: Index Improvements

Phase 3: Modernization Features

Phase 4: Optional — Full v0.14 Table Rebuild

Expected Impact

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `queue_name` defaults to `gen_random_uuid()::text` instead of `NULL`

2. `get_job` uses the old row-by-row queue checking pattern (Strategy 1)