revertiq/docs/03-system-architecture.md at main · copyleftdev/revertiq

🏗️ RevertIQ — System Architecture & Deployment Blueprint

High-Level Components (service map) [Client (API/CLI/Widget)] | HTTPS/JSON v [API Gateway + Auth] ──> [Rate Limiter] | v [RevertIQ API (App Tier)] | | | | | └──> [Job Orchestrator] ──> [Work Queue] | | └──> [Scheduler/TTL] | v | [Result Cache] (Redis/KeyDB) v [Storage Layer] ├─ Data Lake (Parquet/Arrow on object store) ├─ Metadata DB (Postgres) └─ Provenance/Artifacts (reports, hashes) | v [Compute Cluster: Analysis Workers] ├─ Bars/Quotes Ingestor (Polygon adapters) ├─ Sessionizer (calendar, RTH) ├─ Signal Engine (z-score, EMA/VWAP) ├─ Walk-Forward Evaluator ├─ Bootstrap/FDR/Diagnostics └─ Report Generator (JSON/CSV/PNG) | v [Notification Layer] ├─ Webhook Dispatcher (HMAC) └─ Email (ops only)
Data Flow (end-to-end)

Request → Client calls POST /v1/analyze.

Auth & Limits → API Gateway validates token, enforces per-tenant quotas.

Job Create → App writes analysis_id, params, and status=queued to Postgres; enqueues job in Work Queue.

Ingest & Cache → Worker fetches Polygon bars/quotes, normalizes, writes Parquet shards (by ticker/interval/day), updates data_hash.

Analysis → Worker runs sessionization → signal calc → walk-forward → bootstrap → FDR → diagnostics.

Persist Results → JSON results to Object Store, summary rows to Postgres, hot subset to Redis.

Notify → Webhook (if provided) with compact summary; client can GET /analysis/{id}.

Serve → Subsequent reads are fulfilled from Redis then Object Store; API always returns provenance.

Storage & Formats

Object Store (S3/GCS/Azure Blob)

lake/market/{provider}/bars/{ticker}/{interval}/{date}.parquet

results/{analysis_id}/result.json

artifacts/{analysis_id}/{viz}.png

Parquet/Arrow for columnar speed, compression (ZSTD), schema evolution.

Postgres (metadata): tenants, API keys, jobs, status, summaries, webhook configs.

Redis/KeyDB (cache): hot results, job progress, rate tokens.

Determinism: every run stores params.json and data_hash (concat of parquet shards + params) → reproducible.

Compute Topology

Workers: stateless containers; horizontal autoscale on queue depth.

Queues: durable (e.g., SQS/PubSub/Rabbit); dead-letter for failures.

Concurrency model:

Level 1: fan-out by (day-of-week × intraday window).

Level 2: inside each, parallelize walk-forward folds.

Affinity: shard by ticker#interval to exploit data locality (node cache).

Vectorized math: SIMD-friendly numeric libs; block bootstrap.

Caching Strategy

Result cache: 24–72h TTL for identical requests (same params + data_hash).

Data cache: parquet shards persisted; local disk LRU on workers.

Partial caching: intermediate aggregates (z-score, volatility) keyed by (ticker, interval, lookback).

Security & Compliance

AuthN: Bearer API keys; per-tenant scopes; rotation endpoints.

AuthZ: plan-based quotas/limits; RBAC for dashboard.

Transport: TLS 1.2+; HSTS; secure ciphers.

At-rest: SSE (object store), TDE (Postgres), encrypted Redis.

Webhooks: HMAC signature header; replay protection via Idempotency-Key.

Secrets: KMS/SM; no secrets in images.

PII: none by design.

Provenance: every payload includes revertiq_version, data_hash, polygon metadata.

Observability

Metrics (Prometheus/OpenTelemetry):

API: RPS, p50/p95 latency, 4xx/5xx rates, quota rejections.

Workers: jobs/s, success/fail %, runtime per phase, cache hit ratio.

Math: bootstrap samples/sec, FDR compute time, window fan-out size.

Logs: structured JSON with analysis_id, tenant_id, trace_id.

Traces: end-to-end spans from API → worker → storage.

Dashboards: SLO burn rates, queue depth, autoscale events.

Alerts: on SLO breaches, high DLQ, cache miss spikes, provider errors.

SLOs, SLAs, Quotas

SLOs

Availability: 99.9% (API), 99.5% (compute)

P95 latency: ≤ 3s (cached), ≤ 15s (standard run), ≤ 60s (heavy quotes mode)

Data freshness (Polygon ingest): ≤ 24h for backfill

Tenant Quotas (example tiers)

Starter: 200 analyses/day, 1 concurrent job, 2-year horizon cap

Pro: 2k/day, 5 concurrent, 5-year horizon

Enterprise: custom; priority queue, VPC peering

Failure Handling & Idempotency

Idempotent POST via Idempotency-Key (dedupe in Postgres).

Retries: exponential backoff; jitter; max attempts; DLQ on permanent failures.

Compensation: cleanup partial artifacts on failure; mark status=failed with reason.

Provider faults (Polygon): classify transient vs permanent; cache 5xx and slow-down via token bucket.

Runbooks (Ops)

Degraded compute: scale workers; drain DLQ; bump queue visibility timeout.

Cache stampedes: enable request coalescing (single-flight) on analysis_id.

Schema evolution: versioned Parquet; write-new/read-old; blue/green on API.

Key leak: revoke, rotate; audit logs to confirm scope.

Bad release: canary 5%; auto-rollback on error rate/latency thresholds.

Cost Model (FinOps)

Major drivers: market data egress, compute minutes (bootstrap + walk-forward), object storage I/O.

Levers:

Cache parquet once; dedupe same horizon/interval.

Limit window expansion (≤ 40 bins).

Adaptive bootstrap (fewer samples when effect size is large).

Precompute common intervals (1m, 5m) for top tickers.

Deployment & Environments

Envs: dev / staging / prod; separate projects and secrets.

Infra: containerized (K8s or ECS); IaC (Terraform); GitOps for manifests.

CI/CD: unit + property tests → integration (mock Polygon) → load test → canary → prod.

Regions: multi-AZ primary; optional multi-region active/passive for enterprise.

Data Governance & Reproducibility

Manifest per run: params.json, input_shards[], hashes[], engine_version, calendar_version.

Deterministic seeds for bootstrap and FDR ordering.

Audit API: GET /v1/analysis/{id}/manifest for regulators/clients.

Roadmap Hooks (technical)

Live Scout: add stream processor (Kafka/PubSub) to compare live z-scores vs top windows → push alerts.

Regime Classifier: background job computes volatility/trend regimes nightly; tag analyses.

Portfolio Mode: cross-ticker fan-out with correlation constraints; extra worker pool.

Quotes-first Mode: specialized workers with NBBO tick ingestion and microstructure modeling.

Non-Functional Requirements (NFRs)

Determinism: identical inputs → byte-identical outputs (except timestamps).

Throughput: ≥ 50 completed analyses/minute at scale with warm cache.

Extensibility: new metrics/features must be additive to schemas.

Accessibility: dashboard WCAG AA; keyboard navigable.

Risks & Mitigations Risk Mitigation Overfitting accusations Walk-forward, bootstrap CIs, FDR, Reality-Check (roadmap), full provenance Provider outages Multi-day caching; graceful degradation; queue pausing Cold-start latency Warm popular tickers; pre-compute common windows nightly Exploding search space Bounded grids; min_trades filters; early-stopping heuristics Cost blow-ups Tiered quotas; adaptive bootstrap; parquet reuse
Release Plan (MVP → GA)

MVP (Month 1–2): /analyze sync/async, bars-mode costs, heatmap + top windows, bootstrap + FDR, provenance.

Beta (Month 3): quotes-mode costs, webhooks, drift monitor, explain mode.

GA (Month 4–5): batch API, widget SDK, audit endpoint, enterprise quotas/SAML.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

03-system-architecture.md

Latest commit

History

03-system-architecture.md

File metadata and controls