Skip to content

perf: Job acquisition performance — upgrade get_job, fix queue_name default, add partial indexes #60

@devin-ai-integration

Description

@devin-ai-integration

Problem

The @pgpm/jobs and @pgpm/database-jobs packages (v0.15.5) have three compounding performance issues in the job acquisition path that, together, produce the worst-case scenario documented in upstream Graphile Worker issue #240. Under queue contention, job acquisition can be 20x–50x slower than it needs to be.

1. queue_name defaults to gen_random_uuid()::text instead of NULL

Both jobs and scheduled_jobs tables define:

queue_name text DEFAULT (public.gen_random_uuid())::text,

Upstream Graphile Worker made queue_name nullable with no default in v0.4.0 (migration 000003), describing it as a "significant performance enhancement" because it "changes database schema such that a job_queue record is only added/checked when necessary."

With the random UUID default, every job creates a unique queue — meaning:

  • The increase_job_queue_count trigger fires on every insert, creating a job_queues row per job
  • get_job must evaluate the expensive EXISTS ... FOR UPDATE SKIP LOCKED subquery on job_queues for every candidate row
  • The job_queues table grows unboundedly with unique entries
  • Queue bookkeeping runs at maximum cost for zero benefit

Note: The queue-count triggers already have WHEN (NEW.queue_name IS NOT NULL) guards — so the fix is just changing the default.

2. get_job uses the old row-by-row queue checking pattern (Strategy 1)

The current get_job checks queue eligibility per candidate job row:

AND (jobs.queue_name IS NULL
  OR EXISTS (
    SELECT 1 FROM app_jobs.job_queues
    WHERE job_queues.queue_name = jobs.queue_name
      AND (job_queues.locked_at IS NULL
        OR job_queues.locked_at < (v_now - job_expiry))
    FOR UPDATE SKIP LOCKED
  ))

This is Strategy 1 (per-row EXISTS) in the upstream nomenclature. Upstream issue #240 documented this exact query going from 242ms to 13ms (18x) when replaced with Strategy 2 (pre-computed eligible queue set). The upstream benchmarks show:

Strategy Description Jobs/sec
0 Skip named queues entirely 11,800
1 Per-row EXISTS (current pgpm approach) ~40
2 Pre-lock queues upfront (recommended) 843
3 Exclude locked queues (unsafe) 600

3. Legacy ordering index

The current index:

CREATE INDEX priority_run_at_id_idx ON app_jobs.jobs (priority, run_at, id);

Upstream v0.14 replaced this with partial covering indexes filtered on an is_available generated column:

CREATE INDEX jobs_main_index ON jobs USING btree (priority, run_at)
  INCLUDE (id, task_id, job_queue_id) WHERE (is_available = true);

The partial index excludes locked/exhausted jobs from the index entirely, making it smaller and faster to scan.


Solution

Phase 1: Quick Wins (High Impact, Low Risk)

1a. Fix queue_name default to NULL

In both @pgpm/jobs and @pgpm/database-jobs, change the jobs and scheduled_jobs table definitions:

-- Before:
queue_name text DEFAULT (public.gen_random_uuid())::text,

-- After:
queue_name text DEFAULT NULL,

Audit any downstream code that assumes queue_name is always non-null. The existing get_job, complete_job, fail_job, and triggers already handle NULL correctly.

1b. Rewrite get_job with pre-computed queue set (Strategy 2)

Replace the row-by-row EXISTS with a single pre-computed subquery:

-- Key change in the WHERE clause:
AND (jobs.queue_name IS NULL
  OR jobs.queue_name IN (
    SELECT jq.queue_name
    FROM app_jobs.job_queues jq
    WHERE (jq.locked_at IS NULL
        OR jq.locked_at < (v_now - job_expiry))
    FOR UPDATE SKIP LOCKED
  ))

The subquery is evaluated once and PostgreSQL uses a hashed semi-join, eliminating the per-row queue-lock check.

Phase 2: Index Improvements

2a. Add is_available generated column

ALTER TABLE app_jobs.jobs ADD COLUMN is_available boolean
  GENERATED ALWAYS AS ((locked_at IS NULL) AND (attempts < max_attempts)) STORED;

2b. Replace legacy index with partial covering indexes

CREATE INDEX CONCURRENTLY jobs_main_index
  ON app_jobs.jobs USING btree (priority, run_at)
  INCLUDE (id, queue_name) WHERE (is_available = true);

CREATE INDEX CONCURRENTLY jobs_no_queue_index
  ON app_jobs.jobs USING btree (priority, run_at)
  INCLUDE (id) WHERE (is_available = true AND queue_name IS NULL);

-- After benchmarking:
DROP INDEX priority_run_at_id_idx;

Phase 3: Modernization Features

  • Add revision integer DEFAULT 0 and flags jsonb columns
  • Add forbidden_flags parameter to get_job
  • Change notify trigger from FOR EACH ROW to FOR EACH STATEMENT
  • Add remove_job(job_key text) function
  • Add force_unlock_workers(worker_ids text[]) function

Phase 4: Optional — Full v0.14 Table Rebuild

  • tasks lookup table with integer FK (replaces text task_identifier on hot path)
  • job_queue_id integer FK (replaces text queue_name join)
  • smallint for priority/attempts/max_attempts

Phase 4 is optional — the performance gains are marginal compared to Phases 1–2, and the migration risk is higher (especially preserving the database_id multi-tenant design).


Expected Impact

Change Expected Improvement
Fix queue_name default ~2–5x for add_job (eliminates queue bookkeeping)
Rewrite get_job (Strategy 2) ~20x acquisition speedup under contention
Partial indexes with is_available ~2–3x for get_job on large tables
Combined Phase 1 + 2 Estimated 10–50x throughput improvement

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions