Skip to content

/api/client/register causes PostgreSQL lock contention (Unleash OSS 7.4.1) #11390

@davispalomino

Description

@davispalomino

Describe the bug

We are observing sustained PostgreSQL lock contention (“blocked connections / lock waits”) caused by concurrent writes from the Unleash server to the client_applications table. During peak events, multiple sessions wait on wait_event_type='Lock' / wait_event='transactionid' for tens of seconds.

Using pg_blocking_pids(), the top blockers consistently show a batch INSERT into client_applications (app_name, seen_at, updated_at) (likely an upsert due to the PK on app_name). The problem becomes pronounced when 20+ pods register in parallel, for example during rollouts/restarts or autoscaling events.

Per Unleash docs, SDKs call POST /api/client/register on startup to register their existence (appName, strategies, SDK version, etc.).

Steps to reproduce the bug

  1. Deploy Unleash OSS 7.4.1 pointing to PostgreSQL.
  2. Run many client services/SDKs in Kubernetes with the same appName (or multiple appNames) and high replica counts (e.g., 10–20+ pods per app).
  3. Trigger a high-concurrency event (e.g., rollout restart, mass redeploy, or HPA scale-up) so that 20+ pods register in parallel and hit POST /api/client/register around the same time.
  4. Observe in PostgreSQL:
    • increased sessions with wait_event_type='Lock'
    • wait_event='transactionid'
    • blockers with query insert into "client_applications"...
  5. (Optional) Confirm in Unleash UI Project → Applications that some applications report high “instances” counts.

Expected behavior

Under high instance count (many pods/SDKs), client registration should not cause sustained DB lock contention or long lock waits (tens of seconds). Ideally, the system should debounce/batch these writes and/or minimize row-level conflicts when many instances register concurrently.

Logs, error output, etc.

### PostgreSQL: constraint (PK) on `client_applications`

SELECT conname, contype, pg_get_constraintdef(oid) AS def
FROM pg_constraint
WHERE conrelid = 'client_applications'::regclass;


Expected output:
- `client_applications_pkey` PRIMARY KEY (`app_name`)

### PostgreSQL: lock summary (snapshot)

SELECT
  now() AS ts,
  count(*) AS waiting_sessions,
  max(now() - query_start) AS max_wait,
  avg(extract(epoch FROM (now() - query_start)))::int AS avg_wait_seconds
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';


Example observed during incidents:
- `waiting_sessions`: ~15–17
- `max_wait`: ~25–47s
- `avg_wait_seconds`: ~8–12s

### PostgreSQL: top blockers (confirms blocking query)

SELECT
  now() AS ts,
  b.pid AS blocker_pid,
  b.client_addr,
  b.application_name,
  now() - b.xact_start AS blocker_xact_age,
  now() - b.query_start AS blocker_query_age,
  b.wait_event_type,
  b.wait_event,
  left(b.query, 200) AS blocker_query,
  count(*) AS blocked_sessions
FROM pg_stat_activity a
JOIN pg_stat_activity b
  ON b.pid = ANY (pg_blocking_pids(a.pid))
WHERE a.wait_event_type = 'Lock'
GROUP BY 1,2,3,4,5,6,7,8,9
ORDER BY blocked_sessions DESC, blocker_xact_age DESC;


Example (truncated):
- `insert into "client_applications" ("app_name","seen_at","updated_at") values (...),(...),...`
- `wait_event='transactionid'`
- one blocker can block ~10–12 sessions

### Note on `relation = n/a` in lock breakdowns
When analyzing locks via `pg_locks.relation`, `relation` may appear as `NULL` (“n/a”) because `transactionid`-based waits do not necessarily map to a relation OID in `pg_locks.relation`.

Screenshots

No response

Additional context

  • Unleash version: 7.4.1 (Open Source)
  • Hosting: self-hosted on Kubernetes
  • Database: PostgreSQL (managed)
  • Traffic pattern: multiple client services/SDKs; 20+ pods registering in parallel during rollouts/restarts/autoscaling.
  • UI correlation: in Project → Applications, some apps show high “instances” counts (10–20+), correlating with increased registration concurrency.

Questions for maintainers

  1. Is this pattern (transactionid lock waits caused by concurrent client_applications inserts/upserts) a known issue at scale?
  2. Are there recommended settings/patterns for high-scale Kubernetes deployments to avoid sustained contention?
    • e.g., server-side debouncing/batching for registration writes, different upsert strategy, async persistence, etc.
  3. Any recommended configuration for:
    • DB pool sizing (DATABASE_POOL_MIN/MAX)
    • rate limiting for /api/client/register
    • reducing registration write pressure without losing critical visibility?

What we can provide if needed

  • Full (untruncated) blocker_query text
  • Time series of waiting_sessions / max_wait
  • Unleash replica count and DB pool configuration
  • Ingress counts for POST /api/client/register during incident windows

Unleash version

7.4.1 (Open Source)

Subscription type

Open source

Hosting type

Self-hosted

SDK information (language and version)

Node.js unleash-client ^6.9.4

Metadata

Metadata

Labels

Type

No type

Projects

Status

Investigating

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions