podscripter-project

This is the HuggingFace organization for podscripter, a Dockerized local-first transcription tool built on OpenAI Whisper, pyannote.audio speaker diarization, and sentence-transformers punctuation restoration. Primary language focus: English, Spanish, French.

This org doesn't publish models — Whisper and pyannote live in their own upstream orgs. What lives here is the supporting data that the podscripter project owns and republishes under permissive licenses, primarily for testing and reproducibility.

What's published here

Datasets

podscripter-project/test-fixtures — small, curated EN/ES/FR audio clips (CC-BY 4.0) used by podscripter's Tier 1 regression tests. Audio is sourced from permissively licensed public corpora (LibriSpeech, FLEURS, Common Voice, VoxPopuli, AMI, MLS) and trimmed/concatenated to exercise specific pipeline code paths (single-speaker ASR, multi-speaker diarization, chunked-mode transcription). Each clip ships with verbatim transcripts, speaker turns, source attribution, and per-fixture WER/DER thresholds.

License posture

Everything published here is permissively licensed (CC-BY 4.0 or CC0 1.0). Aggregate licenses match the most restrictive component — typically CC-BY 4.0, which requires attribution and indication of changes when redistributed. Per-source attribution lives in each artifact's dataset card and (for the test-fixtures) in tests/fixtures/audio/LICENSES.md in the podscripter repo.

NC/ND-licensed sources are deliberately excluded so artifacts here can be freely redistributed.

Contributing

Issues, fixture proposals, and bug-reproduction clips all go through the podscripter GitHub repo. The contribution workflow for new audio fixtures covers trimming, licensing requirements, the .expected.json schema, and bumping HF_REVISION so the dataset and tests stay in lockstep.