Paste a URL or upload a PDF. Yapit renders the document and reads it aloud.
- Handles the documents other TTS tools can't: academic papers with math, citations, figures, tables, messy formatting. Math is rendered visually but gets spoken alt text. Citations and figure labels are silently displayed or naturalized for speech. Page numbers and headers are removed. All driven by a customizable prompt.
- 58 Kokoro voices across 9 languages. Runs locally in your browser (WebGPU), on CPU, or on GPU. Any OpenAI-compatible TTS server also supported.
- Vim-style keyboard shortcuts, document outliner, media key support, adjustable speed, dark mode, share by link.
- Markdown export: append
/mdto any document URL to get clean markdown via curl./md-annotatedincludes TTS annotations.
git clone --depth 1 https://github.com/yapit-tts/yapit.git && cd yapit
cp .env.selfhost.example .env.selfhost # edit to enable optional features (AI-extraction, custom TTS models)
make self-hostOpen http://localhost. Data persists across restarts.
To stop: make self-host-down.
By default, yapit runs in single-user mode — no login required, all features unlocked. .env.selfhost is self-documenting — see the comments for optional features (AI extraction, custom TTS models).
If you want user accounts with login (e.g., for a family or small team), set AUTH_ENABLED=true in .env.selfhost, uncomment the Stack Auth section below it, and use make self-host-auth instead. This adds Stack Auth and ClickHouse containers. Note: in single-user mode, all requests share one user — everyone on the network sees the same document library.
Use any server implementing the OpenAI /v1/audio/speech API (vLLM-Omni, Kokoro-FastAPI, AllTalk, Chatterbox TTS, etc.).
Add to .env.selfhost:
OPENAI_TTS_BASE_URL=http://your-tts-server:8091/v1
OPENAI_TTS_API_KEY=your-key-or-empty
OPENAI_TTS_MODEL=your-model-nameVoices are auto-discovered if the server supports GET /v1/audio/voices. Otherwise set OPENAI_TTS_VOICES=voice1,voice2,....
Example: OpenAI TTS
OpenAI doesn't support voice auto-discovery, so OPENAI_TTS_VOICES is required.
OPENAI_TTS_BASE_URL=https://api.openai.com/v1
OPENAI_TTS_API_KEY=sk-...
OPENAI_TTS_MODEL=tts-1
OPENAI_TTS_VOICES=alloy,echo,fable,nova,onyx,shimmerExample: Qwen3-TTS via vLLM-Omni
Requires GPU. The default stage config assumes >=16GB VRAM. For 8GB cards (e.g., RTX 3070 Ti), create a custom config with lower sequence lengths and memory utilization — see the stage config reference.
pip install vllm-omni
vllm-omni serve Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice \
--omni --port 8091 --trust-remote-code --enforce-eager \
--stage-configs-path /path/to/stage_configs.yaml # if you have low VRAM. `max_model_len: 1024` should work on 8GBThen configure yapit:
OPENAI_TTS_BASE_URL=http://your-gpu-host:8091/v1
OPENAI_TTS_API_KEY=EMPTY
OPENAI_TTS_MODEL=Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoiceVoices are auto-discovered from the server (9 built-in speakers for CustomVoice models).
Vision-based PDF/image processing works with any OpenAI-compatible API.
Add to .env.selfhost:
AI_PROCESSOR=openai
AI_PROCESSOR_BASE_URL=https://openrouter.ai/api/v1 # or your vLLM/Ollama endpoint
AI_PROCESSOR_API_KEY=your-key
AI_PROCESSOR_MODEL=qwen/qwen3-vl-235b-a22b-instruct # any vision-capable modelOr use Google Gemini directly (with batch-mode support): AI_PROCESSOR=gemini + GOOGLE_API_KEY=your-key.
Kokoro and YOLO run as pull-based workers — any machine with Redis access can join. Connect from the local network or via Tailscale. GPU and CPU workers run side-by-side; faster workers naturally pull more jobs. Scale by running more containers on any machine that can reach Redis.
Prereq: Docker 25+, nvidia-container-toolkit with CDI enabled, network access to the Redis instance.
# One-time GPU setup: generate CDI spec + enable CDI in Docker
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# Add {"features": {"cdi": true}} to /etc/docker/daemon.json, then:
sudo systemctl restart docker
git clone --depth 1 https://github.com/yapit-tts/yapit.git && cd yapit
# Pull only the images you need
docker compose -f docker-compose.worker.yml pull kokoro-gpu yolo-gpu
# Start 2 Kokoro + 1 YOLO worker
REDIS_URL=redis://<host>:6379/0 docker compose -f docker-compose.worker.yml up -d \
--scale kokoro-gpu=2 --scale yolo-gpu=1 kokoro-gpu yolo-gpuAdjust --scale to your GPU. A 4GB card fits 2 Kokoro + 1 YOLO comfortably.
NVIDIA MPS (recommended for multiple workers per GPU)
MPS lets multiple workers share one GPU context — less VRAM overhead, no context switching. Without MPS, each worker gets its own CUDA context (~300MB each). The compose file mounts the MPS pipe automatically; just start the daemon.
sudo tee /etc/systemd/system/nvidia-mps.service > /dev/null <<'EOF'
[Unit]
Description=NVIDIA Multi-Process Service (MPS)
After=nvidia-persistenced.service
[Service]
Type=forking
ExecStart=/usr/bin/nvidia-cuda-mps-control -d
ExecStop=/bin/sh -c 'echo quit | /usr/bin/nvidia-cuda-mps-control'
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now nvidia-mpsNext:
- Support exporting audio as MP3.
Later:
- Support thinking parameter for AI extraction
uv sync # install Python dependencies
npm install --prefix frontend # install frontend dependencies
make dev-env 2>/dev/null || touch .env # decrypt secrets, or create empty .env
make dev-cpu # start backend services (Docker Compose)
cd frontend && npm run dev # start frontend
make test-local # run testsSee agent/knowledge/dev-setup.md for full setup instructions.
The agent/knowledge/ directory is the project's in-depth knowledge base, maintained jointly with Claude during development.
Built with Kokoro, defuddle, DocLayout-YOLO. The hosted version at yapit.md also uses Gemini.