Commit 0f8d99e

and

committed

feat: quant-server-unified — server built directly on quant.h

New server binary that compiles against quant.h instead of libturboquant, eliminating the sync-divergence bug (#77, #78). Key results (Apple M3, 16GB): SmolLM2-1.7B: 23 tok/s (was: garbage via libturboquant) Phi-3.5-mini: 6.5 tok/s (was: crash or garbage via libturboquant) Build: cc -O2 -o quant-server-unified tools/quant_server_unified.c -lm -lpthread Features: - OpenAI-compatible API (/v1/chat/completions, /v1/models, /health) - SSE streaming (stream: true) - CORS headers - Auto-detect Phi-3 chat template vs ChatML - Template token filtering (<|im_end|>, <|end|>, etc.) - Mutex-serialized inference (safe for concurrent HTTP clients) - Graceful port-in-use error No libturboquant dependency. No Metal/CUDA (pure CPU NEON). Single file, zero external dependencies beyond libc. Fixes #77 (SmolLM2 numerical instability in libturboquant) Refs #78 (quant.h as single source of truth) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1 parent e8f9087 commit 0f8d99eCopy full SHA for 0f8d99e

1 file changed

tools
- quant_server_unified.c

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 0f8d99e

File tree

0 commit comments