All notable changes to Headroom will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
headroom unwrap codexnow actually undoesheadroom wrap codex— previously there was nounwrap codexsubcommand at all, so the injectedmodel_provider = "headroom"/[model_providers.headroom]block stayed in~/.codex/config.tomlforever and Codex continued routing through the (potentially stopped) proxy, surfacing asMissing environment variable: OPENAI_API_KEY.wrap codexnow snapshots the pre-wrapconfig.tomltoconfig.toml.headroom-backupbefore its first injection, andunwrap codexrestores that snapshot byte-for-byte (or, if the backup is missing, strips only the Headroom-managed block and leaves surrounding user content intact). Safe no-op when run without a prior wrap. Reported by @raenaryl in Discord.headroom learnno longer clobbers prior recommendations on re-run — the marker block inCLAUDE.md/MEMORY.mdis now merged with the prior block instead of wholesale-replaced. Sections re-surfaced by the new run win; sections not re-surfaced are carried forward so learnings accumulate across runs instead of disappearing. To fully rebuild the block, delete it manually and re-run. (#231)headroom learnno longer emits dangling cross-references when a section is re-surfaced — the analyzer now includes the project's current<!-- headroom:learn -->block (fromCLAUDE.mdandMEMORY.md) in the LLM digest as a "Prior Learned Patterns" section, and the system prompt instructs the LLM that re-emitting a section replaces the prior one wholesale. Prevents bullets like "Xis also large — same rule asY,Z" from appearing afterYandZgot dropped during per-section replacement. The writer's section-level carry-forward from #231 remains in place as a safety net for sections the LLM omits entirely. New helperextract_marker_blockadded toheadroom.learn.writer.
turn_idlinking agent-loop API calls to a single user prompt — a newcompute_turn_id(model, system, messages)helper inheadroom/proxy/helpers.pyhashes the message prefix up to and including the last user-text message, yielding an id that is stable across every agent-loop iteration of one prompt but rolls over when the user sends a new prompt (or runs/compact,/clear).RequestLoggained aturn_id: str | Nonefield, which is stamped at every log site (anthropic handler bedrock + direct branches, and the streaming handler) and surfaced asturn_idin/transformations/feed. Lets downstream consumers (e.g. the Headroom Desktop Activity tab) aggregate savings per user prompt rather than per API call.- Live flush of traffic-learned patterns to CLAUDE.md / MEMORY.md — the
TrafficLearnernow writes to agent-native context files continuously during proxy operation, not just at shutdown. A new dirty-flag debounced_flush_worker(10s window,FLUSH_DEBOUNCE_SECONDS) callsflush_to_file()whenever_accumulate()marks the learner dirty, so patterns surface inCLAUDE.md/MEMORY.mdnear real-time. Flushes read both persisted rows (via_load_persisted_patterns_from_sqlite) and the in-memory accumulator, bucket patterns by project via the learn plugin registry (plugin.discover_projects()+ longest-path anchoring in_project_for_pattern), and route byPatternCategoryto the correct file (_patterns_to_recommendations+_CATEGORY_TO_TARGET). Live flushes requireevidence_count >= 2; the shutdown flush accepts single-evidence rows.
- Traffic-learner evidence count stuck at 1; duplicate DB rows across
restarts.
_accumulatequeued patterns with the defaultExtractedPattern.evidence_count = 1regardless of how many times the pattern was actually seen, so every persisted row landed at1and never crossed the live-flush gate (evidence_count >= 2). Worse, once a pattern was in_saved_hashesit was early-returned on every re-sighting, and_saved_hashesreset on process restart — so a second sighting in a later session inserted a duplicate row rather than bumping the existing one. Now:_accumulatewrites the real accumulated count at save time,start()hydrates_saved_hashes+ a new_persisted_idsmap from the DB, and re-sightings bump the persisted row'smetadata.evidence_countvia an atomicjson_setUPDATE(_bump_persisted_evidence)._load_persisted_patterns_from_sqlitenow filters viajson_extract(metadata, '$.source')instead of a LIKE on the raw JSON string, so rows survive metadata rewrites.
- Telemetry stack & install-mode identity fields — anonymous beacon now
reports
headroom_stack(how Headroom is invoked:proxy,wrap_claude,adapter_ts_openai, ...) andinstall_mode(wrapped/persistent/on_demand), plusrequests_by_stackfor proxies that serve multiple integrations. Proxy exposes aby_stackbucket alongsideby_provider/by_modelon/stats, a matchingheadroom_requests_by_stackPrometheus counter, and anX-Headroom-Stackheader honored by the FastAPI middleware.headroom wrap <tool>setsHEADROOM_STACK=wrap_<agent>; the TS SDK and all four adapters (openai,anthropic,gemini,vercel-ai) tag their compress calls. Schema migration:sql/upgrade_telemetry_stack_context.sql. - Canonical filesystem contract (issue #175) — new
HEADROOM_CONFIG_DIR(default~/.headroom/config, read-mostly) andHEADROOM_WORKSPACE_DIR(default~/.headroom, read-write state) env vars recognized by the Python proxy/CLI and the npm SDK. Additive; all existing per-resource env vars (HEADROOM_SAVINGS_PATH,HEADROOM_TOIN_PATH,HEADROOM_SUBSCRIPTION_STATE_PATH,HEADROOM_MODEL_LIMITS) continue to work with identical semantics. Docker install scripts anddocker-compose.native.ymlforward the new vars into containers so savings, logs, and telemetry resolve to the bind-mounted.headroompath. Seewiki/filesystem-contract.md.
/stats-historynow returns compact checkpoint history by default — the JSON response keeps recent checkpoints dense while evenly sampling older checkpoints so long-running installs do not return ever-growing payloads. Addhistory_mode=fullto fetch the full retained checkpoint list, orhistory_mode=noneto skip it entirely while still receiving the derived hourly/daily/weekly/monthly rollups. Responses now include ahistory_summaryblock describing stored versus returned points.
- Streaming Anthropic requests are now visible to
/stats.recent_requestsand/transformations/feed—_finalize_stream_responsedid not callself.logger.log(...), so the entire streaming Anthropic code path (the one Claude Code uses) silently bypassed the request logger. Only the non-streaming Anthropic path and the Bedrock streaming path were logged. As a consequence,--log-messageshad no observable effect on the live transformations feed for typical traffic. The streaming finalizer now emits the sameRequestLogshape the other paths do, includingrequest_messageswhenlog_full_messagesis enabled.
- Cross-agent memory — Claude saves a fact, Codex reads it back. All agents sharing one proxy share one memory store. Project-scoped DB at
.headroom/memory.db, auto user_id from$USER. - Agent provenance tracking — every memory records which agent saved it (
source_agent,source_provider,created_via), with edit history on updates. - LLM-mediated dedup — on
memory_save, enriched response hints similar existing memories to the LLM. Background async dedup auto-removes >92% cosine duplicates. Zero extra LLM calls. - Memory for OpenAI and Gemini handlers — context injection + tool handling wired into all three provider handlers (Anthropic, OpenAI, Gemini).
- Plugin architecture for
headroom learn— each agent (Claude, Codex, Gemini) is a self-contained plugin. External plugins register viaheadroom.learn_pluginentry points.--agentflag for CLI. - GeminiScanner for
headroom learn— reads~/.gemini/tmp/*/chats/session-*.jsonand.jsonl. - Code graph integration —
headroom wrap claude --code-graphauto-indexes the project via codebase-memory-mcp for call-chain traversal, impact analysis, and architectural queries. Opt-in, ~200 token overhead with Claude Code's MCP Tool Search. - OpenAI embedder auto-detection — memory backend uses OpenAI embeddings when
sentence-transformersis unavailable (no torch/2GB dependency needed). - Live traffic learning flush —
headroom wrap <agent> --learnflushes learned patterns to the correct agent-native file (MEMORY.md / AGENTS.md / GEMINI.md) at proxy shutdown.
- CodeCompressor disabled by default — AST-based code compression produced invalid syntax on 40% of real files. Code now passes through uncompressed. Use
--code-graphfor code intelligence instead, or re-enable with--code-aware. - Shared tool name map — consolidated tool normalization across all learn plugins into
_shared.py. - Dynamic CLI agent detection —
headroom learndiscovers agents via plugin registry, no hardcoded choices.
- CodeCompressor statement-based truncation — body truncation now walks AST statements (not lines), never cuts mid-expression. Fixes syntax errors on multi-line dict literals and function calls.
- Docstring FIRST_LINE mode — uses source lines directly instead of reconstructing from byte offsets. Properly handles all quote styles.
- Memory shutdown queue drain — patterns in the save queue were lost on proxy shutdown. Now drained before exit.
- Codex-proxy resilience hardening — reduces event-loop starvation under cold-start reconnect storms
- Stage-timing instrumentation — per-stage durations for both Codex WS accept and Anthropic
/v1/messagespre-upstream phases emitted as a singleSTAGE_TIMINGSstructured log line per request plus Prometheus histograms - Per-pipeline shared warmup — Anthropic + OpenAI pipelines eagerly load compressors/parsers once at startup; status merged into
WarmupRegistryfor/debug/warmupand/readyz - WS session registry — first-class tracking of active Codex WS sessions with deterministic relay-task cancellation and termination-cause classification (
client_disconnect,upstream_error,client_timeout, etc.) - Bounded pre-upstream Anthropic concurrency —
--anthropic-pre-upstream-concurrency/HEADROOM_ANTHROPIC_PRE_UPSTREAM_CONCURRENCYcaps simultaneous/v1/messagespre-upstream work (body read, deep copy, first compression stage, memory-context lookup, upstream connect) so replay storms cannot starve/livez,/readyz, and new Codex WS opens. Default: automax(2, min(8, cpu_count));0or negative disables (unbounded) - Loopback-only debug endpoints —
/debug/tasks,/debug/ws-sessions,/debug/warmupreturn404(not403) to non-loopback callers so external scanners cannot enumerate them - Reconnect-storm repro harness —
scripts/repro_codex_replay.pydrives concurrent WS + HTTP replay traffic against a local proxy and asserts/livezp99 under threshold;--jsonoutput routes JSON to stdout and the human summary to stderr
- Stage-timing instrumentation — per-stage durations for both Codex WS accept and Anthropic
- Proxy liveness and readiness health checks
- Adds
GET /livezfor process liveness andGET /readyzfor traffic readiness - Keeps
GET /healthbackward compatible while expanding it with readiness details and subsystem checks - Eagerly initializes configured memory backends during proxy startup so readiness reflects real serving capability
- Wires
/readyzinto the Docker imageHEALTHCHECKand the exampledocker-compose.yml
- Adds
- Durable proxy savings history
- Persists proxy compression savings history locally at
~/.headroom/proxy_savings.json - Supports
HEADROOM_SAVINGS_PATHto override the storage location - Adds
/stats-historywith lifetime totals plus hourly/daily/weekly/monthly rollups - Supports JSON and CSV export from
/stats-history - Extends
/statswith apersistent_savingsblock while keepingsavings_historybackward compatible - Adds a historical mode to
/dashboardbacked by/stats-history, including export actions
- Persists proxy compression savings history locally at
- Proxy telemetry SDK override via
HEADROOM_SDK- Downstream apps can override the anonymous telemetry
sdkfield without patching installed files - Blank values fall back to the default
proxylabel
- Downstream apps can override the anonymous telemetry
headroom learn— Offline failure learning for coding agents- Analyzes past conversation history (Claude Code, extensible to Cursor/Codex)
- Success correlation: for each failure, finds what succeeded after and extracts the specific correction
- 5 analyzers: Environment, Structure, Command Patterns, Retry Prevention, Cross-Session
- Writes specific learnings to CLAUDE.md (stable project facts) and MEMORY.md (session patterns)
- Generic architecture: tool-agnostic
ToolCallmodel, pluggable Scanner/Writer adapters - Dry-run by default,
--applyto write,--allfor all projects - Example output: "FirstClassEntity.java is not at axion-formats/ — actually at axion-scala-common/"
- Read Lifecycle Management — Event-driven compression of stale/superseded Read outputs
- Detects when a Read output becomes stale (file was edited after) or superseded (file was re-read)
- Replaces stale/superseded content with compact CCR markers, stores originals for retrieval
- 75% of Read output bytes are provably stale or redundant (from real-world analysis of 66K tool calls)
- Fresh Reads (latest read, no subsequent edit) are never touched — Edit safety preserved
- Opt-in via
ReadLifecycleConfig(enabled=True), disabled by default - Handles both OpenAI and Anthropic message formats
- any-llm backend - Route requests through 38+ LLM providers (OpenAI, Mistral, Groq, Ollama, etc.) via any-llm
- Enable with
--backend anyllm --anyllm-provider <provider> - Install with:
pip install 'headroom-ai[anyllm]'
- Enable with
- Production-ready proxy server with caching, rate limiting, and metrics
- CLI command
headroom proxyto start the proxy server - IntelligentContextManager (semantic-aware context management)
- Multi-factor importance scoring: recency, semantic similarity, TOIN importance, error indicators, forward references, token density
- No hardcoded patterns - all importance signals learned from TOIN or computed from metrics
- TOIN integration for retrieval_rate and field_semantics-based scoring
- Strategy selection: NONE, COMPRESS_FIRST, DROP_BY_SCORE based on budget overage
- Atomic tool unit handling (call + response dropped together)
- Configurable scoring weights via
ScoringWeightsdataclass IntelligentContextConfigfor full configuration control- Backwards compatible with
RollingWindowConfig
- LLMLingua-2 Integration (opt-in ML-based compression)
LLMLinguaCompressortransform using Microsoft's LLMLingua-2 model- Content-aware compression rates (code: 0.4, JSON: 0.35, text: 0.3)
- Memory management utilities:
unload_llmlingua_model(),is_llmlingua_model_loaded() - Proxy integration via
--llmlinguaflag - Device selection:
--llmlingua-device(auto/cuda/cpu/mps) - Custom compression rate:
--llmlingua-rate - Helpful startup hints when llmlingua is available but not enabled
- Install with:
pip install headroom-ai[llmlingua]
- Code-Aware Compression (AST-based, syntax-preserving)
CodeAwareCompressortransform using tree-sitter for AST parsing- Supports Python, JavaScript, TypeScript, Go, Rust, Java, C, C++
- Preserves imports, function signatures, type annotations, error handlers
- Compresses function bodies while maintaining structural integrity
- Guarantees syntactically valid output (no broken code)
- Automatic language detection from code patterns
- Memory management:
is_tree_sitter_available(),unload_tree_sitter() - Uses
tree-sitter-language-packfor broad language support - Install with:
pip install headroom-ai[code]
- ContentRouter (intelligent compression orchestrator)
- Auto-routes content to optimal compressor based on type detection
- Source hint support for high-confidence routing (file paths, tool names)
- Handles mixed content (e.g., markdown with code blocks)
- Strategies: CODE_AWARE, SMART_CRUSHER, SEARCH, LOG, TEXT, LLMLINGUA
- Configurable strategy preferences and fallbacks
- Routing decision log for transparency and debugging
- Custom Model Configuration
- Support for new models: Claude 4.5 (Opus), Claude 4 (Sonnet, Haiku), o3, o3-mini
- Pattern-based inference for unknown models (opus/sonnet/haiku tiers)
- Custom model config via
HEADROOM_MODEL_LIMITSenvironment variable - Config file support:
~/.headroom/models.json - Graceful fallback for unknown models (no crashes)
- Updated pricing data for all current models
- Event.wait task leak in subscription trackers —
asyncio.shieldpattern prevents cancellation of the outerwait_forfrom leaking the innerEvent.waittask - Python 3.10 compatibility for memory-context fail-open — catches
asyncio.TimeoutError(the 3.10-compatible alias) rather thanTimeoutErrorto preserve behaviour on older runtimes - uvicorn
proxy_headers=False— refusesForwarded/X-Forwarded-Forrewrites so the loopback guard on/debug/*cannot be spoofed by a misconfigured reverse proxy - First-frame timeout for Codex WS accepts — guards against a client that opens a handshake and never sends the first frame; relays cancel deterministically with
client_timeout - Semaphore leak on unexpected exception in Anthropic pre-upstream path — the finalizer now releases the pre-upstream semaphore on every exit path (early 4xx, cache hit, upstream error, streaming handoff)
active_relay_tasksgauge double-decrement —deregister_and_countreturns(handle, released_task_count)atomically so the handler decrements the Prometheus gauge by the exact number it registered, eliminating drift
- IPv6-mapped loopback recognition — the loopback guard parses
::ffff:127.0.0.1and other dual-stack literals throughipaddress.ip_address(...).is_loopback - Lock-free stage-timing accumulators —
record_stage_timingswrites to per-path counters that do not contend with/metricsexport orrecord_request - Narrow
contextlib.suppressin relay classification — onlyCancelledErroris suppressed where we reclassify it; other exceptions propagate so termination cause stays truthful jitter_delay_mshelper — shared exponential-backoff + 50-150% jitter formula inheadroom/proxy/helpers.py; used by three proxy retry sites and mirrored inline in the repro harness
0.2.0 - 2025-01-07
- SmartCrusher: Statistical compression for tool outputs
- Keeps first/last K items, errors, anomalies, and relevance matches
- Variance-based change point detection
- Pattern detection (time series, logs, search results)
- Relevance Scoring Engine: ML-powered item relevance
BM25Scorer: Fast keyword matching (zero dependencies)EmbeddingScorer: Semantic similarity with sentence-transformersHybridScorer: Adaptive combination of both methods
- CacheAligner: Prefix stabilization for better cache hits
- Dynamic date extraction
- Whitespace normalization
- Stable prefix hashing
- RollingWindow: Context management within token limits
- Drops oldest tool units first
- Never orphans tool results
- Preserves recent turns
- Multi-Provider Support:
- Anthropic with official
count_tokensAPI - Google with official
countTokensAPI - Cohere with official
tokenizeAPI - Mistral with official tokenizer
- LiteLLM for unified interface
- Anthropic with official
- Integrations:
- LangChain callback handler (
HeadroomOptimizer) - MCP (Model Context Protocol) utilities
- LangChain callback handler (
- Proxy Server (
headroom.proxy):- Semantic caching with LRU eviction
- Token bucket rate limiting
- Retry with exponential backoff
- Cost tracking with budget enforcement
- Prometheus metrics endpoint
- Request logging (JSONL)
- Pricing Registry: Centralized model pricing with staleness tracking
- Benchmarks: Performance benchmarks for transforms and relevance scoring
- Improved token counting accuracy across all providers
- Enhanced tool output compression with relevance-aware selection
- Mistral tokenizer API compatibility
- Google token counting for multi-turn conversations
0.1.0 - 2025-01-05
- Initial release
HeadroomClient: OpenAI-compatible client wrapperToolCrusher: Basic tool output compression- Audit mode for observation without modification
- Optimize mode for applying transforms
- Simulate mode for previewing changes
- SQLite and JSONL storage backends
- HTML report generation
- Streaming support
- Never removes human content
- Never breaks tool ordering
- Parse failures are no-ops
- Preserves recency (last N turns)
The 0.2.0 release is backward compatible. New features are opt-in:
# Old code still works
from headroom import HeadroomClient, OpenAIProvider
# New SmartCrusher (replaces ToolCrusher for better compression)
from headroom import SmartCrusher, SmartCrusherConfig
config = SmartCrusherConfig(
min_tokens_to_crush=200,
max_items_after_crush=50,
)
crusher = SmartCrusher(config)
# New relevance scoring
from headroom import create_scorer
scorer = create_scorer("hybrid") # or "bm25" for zero depsNew in 0.2.0 - run Headroom as a proxy server:
# Start the proxy
python -m headroom.proxy.server --port 8787
# Use with Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude