🏗️ RevertIQ — System Architecture & Deployment Blueprint
-
High-Level Components (service map) [Client (API/CLI/Widget)] | HTTPS/JSON v [API Gateway + Auth] ──> [Rate Limiter] | v [RevertIQ API (App Tier)] | | | | | └──> [Job Orchestrator] ──> [Work Queue] | | └──> [Scheduler/TTL] | v | [Result Cache] (Redis/KeyDB) v [Storage Layer] ├─ Data Lake (Parquet/Arrow on object store) ├─ Metadata DB (Postgres) └─ Provenance/Artifacts (reports, hashes) | v [Compute Cluster: Analysis Workers] ├─ Bars/Quotes Ingestor (Polygon adapters) ├─ Sessionizer (calendar, RTH) ├─ Signal Engine (z-score, EMA/VWAP) ├─ Walk-Forward Evaluator ├─ Bootstrap/FDR/Diagnostics └─ Report Generator (JSON/CSV/PNG) | v [Notification Layer] ├─ Webhook Dispatcher (HMAC) └─ Email (ops only)
-
Data Flow (end-to-end)
Request → Client calls POST /v1/analyze.
Auth & Limits → API Gateway validates token, enforces per-tenant quotas.
Job Create → App writes analysis_id, params, and status=queued to Postgres; enqueues job in Work Queue.
Ingest & Cache → Worker fetches Polygon bars/quotes, normalizes, writes Parquet shards (by ticker/interval/day), updates data_hash.
Analysis → Worker runs sessionization → signal calc → walk-forward → bootstrap → FDR → diagnostics.
Persist Results → JSON results to Object Store, summary rows to Postgres, hot subset to Redis.
Notify → Webhook (if provided) with compact summary; client can GET /analysis/{id}.
Serve → Subsequent reads are fulfilled from Redis then Object Store; API always returns provenance.
- Storage & Formats
Object Store (S3/GCS/Azure Blob)
lake/market/{provider}/bars/{ticker}/{interval}/{date}.parquet
results/{analysis_id}/result.json
artifacts/{analysis_id}/{viz}.png
Parquet/Arrow for columnar speed, compression (ZSTD), schema evolution.
Postgres (metadata): tenants, API keys, jobs, status, summaries, webhook configs.
Redis/KeyDB (cache): hot results, job progress, rate tokens.
Determinism: every run stores params.json and data_hash (concat of parquet shards + params) → reproducible.
- Compute Topology
Workers: stateless containers; horizontal autoscale on queue depth.
Queues: durable (e.g., SQS/PubSub/Rabbit); dead-letter for failures.
Concurrency model:
Level 1: fan-out by (day-of-week × intraday window).
Level 2: inside each, parallelize walk-forward folds.
Affinity: shard by ticker#interval to exploit data locality (node cache).
Vectorized math: SIMD-friendly numeric libs; block bootstrap.
- Caching Strategy
Result cache: 24–72h TTL for identical requests (same params + data_hash).
Data cache: parquet shards persisted; local disk LRU on workers.
Partial caching: intermediate aggregates (z-score, volatility) keyed by (ticker, interval, lookback).
- Security & Compliance
AuthN: Bearer API keys; per-tenant scopes; rotation endpoints.
AuthZ: plan-based quotas/limits; RBAC for dashboard.
Transport: TLS 1.2+; HSTS; secure ciphers.
At-rest: SSE (object store), TDE (Postgres), encrypted Redis.
Webhooks: HMAC signature header; replay protection via Idempotency-Key.
Secrets: KMS/SM; no secrets in images.
PII: none by design.
Provenance: every payload includes revertiq_version, data_hash, polygon metadata.
- Observability
Metrics (Prometheus/OpenTelemetry):
API: RPS, p50/p95 latency, 4xx/5xx rates, quota rejections.
Workers: jobs/s, success/fail %, runtime per phase, cache hit ratio.
Math: bootstrap samples/sec, FDR compute time, window fan-out size.
Logs: structured JSON with analysis_id, tenant_id, trace_id.
Traces: end-to-end spans from API → worker → storage.
Dashboards: SLO burn rates, queue depth, autoscale events.
Alerts: on SLO breaches, high DLQ, cache miss spikes, provider errors.
- SLOs, SLAs, Quotas
SLOs
Availability: 99.9% (API), 99.5% (compute)
P95 latency: ≤ 3s (cached), ≤ 15s (standard run), ≤ 60s (heavy quotes mode)
Data freshness (Polygon ingest): ≤ 24h for backfill
Tenant Quotas (example tiers)
Starter: 200 analyses/day, 1 concurrent job, 2-year horizon cap
Pro: 2k/day, 5 concurrent, 5-year horizon
Enterprise: custom; priority queue, VPC peering
- Failure Handling & Idempotency
Idempotent POST via Idempotency-Key (dedupe in Postgres).
Retries: exponential backoff; jitter; max attempts; DLQ on permanent failures.
Compensation: cleanup partial artifacts on failure; mark status=failed with reason.
Provider faults (Polygon): classify transient vs permanent; cache 5xx and slow-down via token bucket.
- Runbooks (Ops)
Degraded compute: scale workers; drain DLQ; bump queue visibility timeout.
Cache stampedes: enable request coalescing (single-flight) on analysis_id.
Schema evolution: versioned Parquet; write-new/read-old; blue/green on API.
Key leak: revoke, rotate; audit logs to confirm scope.
Bad release: canary 5%; auto-rollback on error rate/latency thresholds.
- Cost Model (FinOps)
Major drivers: market data egress, compute minutes (bootstrap + walk-forward), object storage I/O.
Levers:
Cache parquet once; dedupe same horizon/interval.
Limit window expansion (≤ 40 bins).
Adaptive bootstrap (fewer samples when effect size is large).
Precompute common intervals (1m, 5m) for top tickers.
- Deployment & Environments
Envs: dev / staging / prod; separate projects and secrets.
Infra: containerized (K8s or ECS); IaC (Terraform); GitOps for manifests.
CI/CD: unit + property tests → integration (mock Polygon) → load test → canary → prod.
Regions: multi-AZ primary; optional multi-region active/passive for enterprise.
- Data Governance & Reproducibility
Manifest per run: params.json, input_shards[], hashes[], engine_version, calendar_version.
Deterministic seeds for bootstrap and FDR ordering.
Audit API: GET /v1/analysis/{id}/manifest for regulators/clients.
- Roadmap Hooks (technical)
Live Scout: add stream processor (Kafka/PubSub) to compare live z-scores vs top windows → push alerts.
Regime Classifier: background job computes volatility/trend regimes nightly; tag analyses.
Portfolio Mode: cross-ticker fan-out with correlation constraints; extra worker pool.
Quotes-first Mode: specialized workers with NBBO tick ingestion and microstructure modeling.
- Non-Functional Requirements (NFRs)
Determinism: identical inputs → byte-identical outputs (except timestamps).
Throughput: ≥ 50 completed analyses/minute at scale with warm cache.
Extensibility: new metrics/features must be additive to schemas.
Accessibility: dashboard WCAG AA; keyboard navigable.
- Risks & Mitigations Risk Mitigation Overfitting accusations Walk-forward, bootstrap CIs, FDR, Reality-Check (roadmap), full provenance Provider outages Multi-day caching; graceful degradation; queue pausing Cold-start latency Warm popular tickers; pre-compute common windows nightly Exploding search space Bounded grids; min_trades filters; early-stopping heuristics Cost blow-ups Tiered quotas; adaptive bootstrap; parquet reuse
- Release Plan (MVP → GA)
MVP (Month 1–2): /analyze sync/async, bars-mode costs, heatmap + top windows, bootstrap + FDR, provenance.
Beta (Month 3): quotes-mode costs, webhooks, drift monitor, explain mode.
GA (Month 4–5): batch API, widget SDK, audit endpoint, enterprise quotas/SAML.