SynthBanshee

SynthBanshee is a config-driven pipeline for generating large-scale synthetic Hebrew audio datasets. It is part of the AVDP (Audio Violence Dataset Project) initiative run by DataHack, which produces training data for two AI safety products:

She-Proves — a smartphone app that passively monitors for domestic violence incidents and preserves audio evidence for legal use
Elephant in the Room (הפיל שבחדר) — a Raspberry Pi–class device in social-work offices that alerts security when a social worker is being physically threatened

The goal of SynthBanshee is to supply AI teams with a wide, deliberate distribution of voices, acoustic conditions, and violence types while real-data (actor recording) pipelines are built in parallel. Synthetic-to-real gap is expected and documented.

How it works

Every clip is produced by four sequential pipeline stages:

SceneConfig (YAML)
  → [1] Script Generator   LLM fills a Jinja2 template → dialogue JSON
  → [2] TTS Renderer       Azure he-IL SSML → per-speaker WAV segments
  → [3] Acoustic Augmenter Room IR + device profile + noise → scene WAV
  → [4] Label Generator    Script structure + augmentation log → AVDP-schema JSONL

Output per clip: {clip_id}.wav + {clip_id}.txt (Hebrew transcript) + {clip_id}.json (metadata).

Audio spec: 16 kHz · mono · 16-bit PCM · −1.0 dBFS peak · ≥ 0.5 s silence pad.

Dataset tiers

Tier	Description	Target (per project)
A	Clean TTS, no acoustic augmentation	1,000 clips
B	Room simulation + device profile + background noise	2,000 clips
C	Hard negatives / confusors (arguments that de-escalate, ambient sounds)	1,000 clips

Two projects — She-Proves (3–6 min clips, apartment rooms) and Elephant in the Room (1–4 min clips, clinic/welfare offices) — each receive the full tier stack.

Label taxonomy

Labels use a three-level hierarchy (no binary Violence/Non-Violence):

Violence typology (scene): SV · IT · NEG · NEU
Tier 1 category (event): PHYS · VERB · DIST · ACOU · EMOT · NONE
Tier 2 subtype (event): e.g. PHYS_HARD · VERB_THREAT · DIST_SCREAM

Full taxonomy: configs/taxonomy.yaml.

Current status

Phase 0 (pipeline foundation) and Phase 1 (full Tier A pipeline) are complete — AI teams can now use the delivered Tier A dataset to start baseline model development.

Phase	Deliverable	Status
0	Single spec-compliant clip end-to-end	Done
1	Script templates + multi-speaker TTS + LLM generator + batch generation + QA suite	Done
2	1,000–1,500 Tier B clips/project	Planned
3	4,000 clips/project, all tiers	Planned

Quick start

# Install (requires Python ≥ 3.11 and uv)
uv pip install -e .

# Generate a single clip from a scene config
synthbanshee generate --config configs/scenes/test_scene_001.yaml

# Generate a full Tier A batch from a run config
synthbanshee generate-batch \
  --run-config configs/run_configs/tier_a_500_she_proves.yaml \
  --output-dir data/he

# Run the automated QA suite on a dataset directory
synthbanshee qa-report data/he

# Validate an existing clip
synthbanshee validate data/he/clip_001.wav

API credentials required (set in environment or .env):

AZURE_TTS_KEY=...
AZURE_TTS_REGION=...
ANTHROPIC_API_KEY=...   # for script generation

Docs

Document	Contents
`docs/spec.md`	Audio format, file naming, label schema, IAA protocol
`docs/implementation_plan.md`	Phased milestones, module map, API cost estimates
`docs/design_approaches.md`	Design decisions and rationale
`CLAUDE.md`	Claude Code context guide (pipeline constraints, conventions)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.claude/commands		.claude/commands
.github		.github
configs		configs
docs		docs
memory		memory
synthbanshee		synthbanshee
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynthBanshee

How it works

Dataset tiers

Label taxonomy

Current status

Quick start

Docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SynthBanshee

How it works

Dataset tiers

Label taxonomy

Current status

Quick start

Docs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages