Commit 0f8d99e
feat: quant-server-unified — server built directly on quant.h
New server binary that compiles against quant.h instead of
libturboquant, eliminating the sync-divergence bug (#77, #78).
Key results (Apple M3, 16GB):
SmolLM2-1.7B: 23 tok/s (was: garbage via libturboquant)
Phi-3.5-mini: 6.5 tok/s (was: crash or garbage via libturboquant)
Build:
cc -O2 -o quant-server-unified tools/quant_server_unified.c -lm -lpthread
Features:
- OpenAI-compatible API (/v1/chat/completions, /v1/models, /health)
- SSE streaming (stream: true)
- CORS headers
- Auto-detect Phi-3 chat template vs ChatML
- Template token filtering (<|im_end|>, <|end|>, etc.)
- Mutex-serialized inference (safe for concurrent HTTP clients)
- Graceful port-in-use error
No libturboquant dependency. No Metal/CUDA (pure CPU NEON).
Single file, zero external dependencies beyond libc.
Fixes #77 (SmolLM2 numerical instability in libturboquant)
Refs #78 (quant.h as single source of truth)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent e8f9087 commit 0f8d99e
1 file changed
Lines changed: 589 additions & 0 deletions
0 commit comments