podscripter-project

This is the HuggingFace organization for podscripter, a Dockerized local-first transcription tool built on OpenAI Whisper, pyannote.audio speaker diarization, and sentence-transformers punctuation restoration. Primary language focus: English, Spanish, French.

This org doesn't publish models — Whisper and pyannote live in their own upstream orgs. What lives here is the supporting data that the podscripter project owns and republishes under permissive licenses, primarily for testing and reproducibility.

What's published here

Datasets

License posture

Everything published here is permissively licensed (CC-BY 4.0 or CC0 1.0). Aggregate licenses match the most restrictive component — typically CC-BY 4.0, which requires attribution and indication of changes when redistributed. Per-source attribution lives in each artifact's dataset card and (for the test-fixtures) in tests/fixtures/audio/LICENSES.md in the podscripter repo.

NC/ND-licensed sources are deliberately excluded so artifacts here can be freely redistributed.

Contributing

Issues, fixture proposals, and bug-reproduction clips all go through the podscripter GitHub repo. The contribution workflow for new audio fixtures covers trimming, licensing requirements, the .expected.json schema, and bumping HF_REVISION so the dataset and tests stay in lockstep.