GitHub - yapit-tts/yapit: Listen to anything. TTS for documents, papers, and web pages.

yapit: Open-source TTS for documents and web pages.

Website | CLI | Self-Host | Architecture

Paste a URL or upload a PDF. Yapit renders the document and reads it aloud.

Handles the documents other TTS tools can't: academic papers with math, citations, figures, tables, messy formatting. Math is rendered visually but gets spoken alt text. Citations and figure labels are silently displayed or naturalized for speech. Page numbers and headers are removed. All driven by a customizable prompt.
58 Kokoro voices across 9 languages. Runs locally in your browser (WebGPU), on CPU, or on GPU. Any OpenAI-compatible TTS server also supported.
Vim-style keyboard shortcuts, document outliner, media key support, adjustable speed, dark mode, share by link.
Markdown export: append /md to any document URL to get clean markdown via curl. /md-annotated includes TTS annotations.

Self-hosting

git clone --depth 1 https://github.com/yapit-tts/yapit.git && cd yapit
cp .env.selfhost.example .env.selfhost # edit to enable optional features (AI-extraction, custom TTS models)
make self-host

Open http://localhost. Data persists across restarts. To stop: make self-host-down.

Multi-user mode

By default, yapit runs in single-user mode — no login required, all features unlocked. .env.selfhost is self-documenting — see the comments for optional features (AI extraction, custom TTS models).

If you want user accounts with login (e.g., for a family or small team), set AUTH_ENABLED=true in .env.selfhost, uncomment the Stack Auth section below it, and use make self-host-auth instead. This adds Stack Auth and ClickHouse containers. Note: in single-user mode, all requests share one user — everyone on the network sees the same document library.

Custom TTS voices

Use any server implementing the OpenAI /v1/audio/speech API (vLLM-Omni, Kokoro-FastAPI, AllTalk, Chatterbox TTS, etc.).

Add to .env.selfhost:

OPENAI_TTS_BASE_URL=http://your-tts-server:8091/v1
OPENAI_TTS_API_KEY=your-key-or-empty
OPENAI_TTS_MODEL=your-model-name

Voices are auto-discovered if the server supports GET /v1/audio/voices. Otherwise set OPENAI_TTS_VOICES=voice1,voice2,....

Example: OpenAI TTS

OpenAI doesn't support voice auto-discovery, so OPENAI_TTS_VOICES is required.

OPENAI_TTS_BASE_URL=https://api.openai.com/v1
OPENAI_TTS_API_KEY=sk-...
OPENAI_TTS_MODEL=tts-1
OPENAI_TTS_VOICES=alloy,echo,fable,nova,onyx,shimmer

Example: Qwen3-TTS via vLLM-Omni

Requires GPU. The default stage config assumes >=16GB VRAM. For 8GB cards (e.g., RTX 3070 Ti), create a custom config with lower sequence lengths and memory utilization — see the stage config reference.

pip install vllm-omni
vllm-omni serve Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice \
    --omni --port 8091 --trust-remote-code --enforce-eager \
    --stage-configs-path /path/to/stage_configs.yaml # if you have low VRAM. `max_model_len: 1024` should work on 8GB

Then configure yapit:

OPENAI_TTS_BASE_URL=http://your-gpu-host:8091/v1
OPENAI_TTS_API_KEY=EMPTY
OPENAI_TTS_MODEL=Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice

Voices are auto-discovered from the server (9 built-in speakers for CustomVoice models).

AI document extraction

Vision-based PDF/image processing works with any OpenAI-compatible API.

Add to .env.selfhost:

AI_PROCESSOR=openai
AI_PROCESSOR_BASE_URL=https://openrouter.ai/api/v1  # or your vLLM/Ollama endpoint
AI_PROCESSOR_API_KEY=your-key
AI_PROCESSOR_MODEL=qwen/qwen3-vl-235b-a22b-instruct  # any vision-capable model

Or use Google Gemini directly (with batch-mode support): AI_PROCESSOR=gemini + GOOGLE_API_KEY=your-key.

GPU workers for Kokoro TTS & YOLO figure detection

Kokoro and YOLO run as pull-based workers — any machine with Redis access can join. Connect from the local network or via Tailscale. GPU and CPU workers run side-by-side; faster workers naturally pull more jobs. Scale by running more containers on any machine that can reach Redis.

Prereq: Docker 25+, nvidia-container-toolkit with CDI enabled, network access to the Redis instance.

# One-time GPU setup: generate CDI spec + enable CDI in Docker
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# Add {"features": {"cdi": true}} to /etc/docker/daemon.json, then:
sudo systemctl restart docker

git clone --depth 1 https://github.com/yapit-tts/yapit.git && cd yapit

# Pull only the images you need
docker compose -f docker-compose.worker.yml pull kokoro-gpu yolo-gpu

# Start 2 Kokoro + 1 YOLO worker
REDIS_URL=redis://<host>:6379/0 docker compose -f docker-compose.worker.yml up -d \
  --scale kokoro-gpu=2 --scale yolo-gpu=1 kokoro-gpu yolo-gpu

Adjust --scale to your GPU. A 4GB card fits 2 Kokoro + 1 YOLO comfortably.

NVIDIA MPS (recommended for multiple workers per GPU)

MPS lets multiple workers share one GPU context — less VRAM overhead, no context switching. Without MPS, each worker gets its own CUDA context (~300MB each). The compose file mounts the MPS pipe automatically; just start the daemon.

sudo tee /etc/systemd/system/nvidia-mps.service > /dev/null <<'EOF'
[Unit]
Description=NVIDIA Multi-Process Service (MPS)
After=nvidia-persistenced.service

[Service]
Type=forking
ExecStart=/usr/bin/nvidia-cuda-mps-control -d
ExecStop=/bin/sh -c 'echo quit | /usr/bin/nvidia-cuda-mps-control'
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now nvidia-mps

Roadmap

Support exporting audio as MP3.

Later:

Support thinking parameter for AI extraction

Development

uv sync                              # install Python dependencies
npm install --prefix frontend        # install frontend dependencies
make dev-env 2>/dev/null || touch .env  # decrypt secrets, or create empty .env
make dev-cpu                         # start backend services (Docker Compose)
cd frontend && npm run dev           # start frontend
make test-local                      # run tests

See agent/knowledge/dev-setup.md for full setup instructions.

The agent/knowledge/ directory is the project's in-depth knowledge base, maintained jointly with Claude during development.

Built with Kokoro, defuddle, DocLayout-YOLO. The hosted version at yapit.md also uses Gemini.

Name		Name	Last commit message	Last commit date
Latest commit History 844 Commits
.claude		.claude
.github/workflows		.github/workflows
agent		agent
dashboard		dashboard
dev		dev
docker		docker
docs		docs
experiments		experiments
frontend		frontend
scripts		scripts
tests		tests
video		video
yapit		yapit
.dockerignore		.dockerignore
.env.dev		.env.dev
.env.prod		.env.prod
.env.selfhost.example		.env.selfhost.example
.env.sops		.env.sops
.env.template		.env.template
.gitignore		.gitignore
.mcp.json		.mcp.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.ruff.toml		.ruff.toml
.sops.yaml		.sops.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.selfhost.yml		docker-compose.selfhost.yml
docker-compose.worker.yml		docker-compose.worker.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website | CLI | Self-Host | Architecture

Self-hosting

Multi-user mode

Custom TTS voices

AI document extraction

GPU workers for Kokoro TTS & YOLO figure detection

Roadmap

Development

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Website | CLI | Self-Host | Architecture

Self-hosting

Multi-user mode

Custom TTS voices

AI document extraction

GPU workers for Kokoro TTS & YOLO figure detection

Roadmap

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages