Skip to content

feat: add Common Voice Explorer#24

Merged
wavekat-eason merged 53 commits intomainfrom
feat/cv-dataset-explorer
Apr 4, 2026
Merged

feat: add Common Voice Explorer#24
wavekat-eason merged 53 commits intomainfrom
feat/cv-dataset-explorer

Conversation

@wavekat-eason
Copy link
Copy Markdown
Contributor

@wavekat-eason wavekat-eason commented Apr 4, 2026

Summary

  • Add a full-stack Common Voice dataset explorer: Cloudflare Worker API + React web app for browsing, filtering, and playing audio clips
  • Add sync script to download Common Voice datasets and store metadata in D1 / audio in R2
  • Add CI workflows for ephemeral Azure VM runner provisioning, dataset sync, and Cloudflare deployment
  • Include GitHub OAuth + terms acceptance gate for access control

Test plan

  • Verify CI workflows run successfully (provision, sync, deploy)
  • Test the web app: login, browse datasets, filter clips, play audio
  • Confirm D1 metadata and R2 audio are populated correctly via sync script

🤖 Generated with Claude Code

wavekat-eason and others added 30 commits April 3, 2026 11:57
Browser-based tool to filter and listen to Common Voice clips for turn
detection model training. Architecture: Cloudflare D1/R2 for data,
ephemeral Azure VM as GitHub Actions runner for dataset sync.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- cv-runner-provision.yml: spins up Azure VM as self-hosted GH Actions
  runner with configurable size, disk, and auto-shutdown timer
- cv-sync.yml: runs Common Voice dataset sync on the runner, then
  cleans up the VM automatically after completion

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add .github/workflows/README.md with setup guide and bash variables
- Use repository variables (vars.) instead of secrets for
  AZURE_RESOURCE_GROUP and AZURE_LOCATION
- Remove deprecated --sdk-auth flag from docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- azure/login@v2 → v3, azure/cli@v2 → v3 (Node.js 24 support)
- VM size options updated from D*s_v3 to D*s_v5 (current gen)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Workflow README: Wrangler setup guide, CF token permissions,
  useful commands for VM management
- cv-sync.yml: dropdown options for locale and split inputs,
  use vars for non-sensitive Cloudflare config
- gitignore .wrangler/ directory

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TypeScript script that downloads a CV dataset from Mozilla Data
Collective, parses TSV metadata into Cloudflare D1, and uploads
MP3 clips to R2. Supports resumable runs (INSERT OR IGNORE, skip
existing R2 objects) and parallel uploads.

- tools/cv-explorer/scripts/sync.ts — main sync logic
- cv-sync.yml — updated with dataset_id input and R2 credentials
- README.md — R2 API token setup guide

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Note that R2 S3-compatible tokens require the dashboard (no wrangler
CLI support), and the secret is only shown once.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The D1 /query API expects `{ batch: [...] }`, not a raw array.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove --locale arg, auto-detect from archive directory
- Parse version from top-level dir (e.g. cv-corpus-25.0-2026-03-09)
- Add datasets table to track sync status
- Support --split all to sync every TSV found
- Add --force flag to re-sync already synced datasets
- DELETE + INSERT for clean re-syncs (no stale rows)
- Add version column to clips table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ALTER TABLE to add version/dataset_id columns that may be absent
from older schema, since CREATE TABLE IF NOT EXISTS won't alter
existing tables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verify D1 and R2 access at startup before downloading or
processing any data, so bad credentials fail fast with a
clear error message.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduce default concurrency from 20 to 8 and add 3-attempt retry
with exponential backoff to handle transient timeouts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Streams are non-retryable by the AWS SDK since they can't be
replayed after consumption. Read files into Buffer so both SDK
internal retries and our manual retry loop work correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add has_audio column to clips table and update it after
successful R2 uploads so the DB knows which clips have
audio available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Query D1 for already-synced splits before downloading the
dataset archive. Add --force workflow input to override.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hono API worker with D1/R2 bindings for browsing Common Voice clips,
plus a React frontend with filtering, sorting, and audio playback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously all datasets extracted into a shared directory, causing
locale detection to pick the wrong locale when multiple datasets
shared the same corpus version. Each dataset now extracts into its
own subdirectory, and stale extractions are cleaned up on each run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Show elapsed time and estimated remaining time during R2 uploads.
Add r2_concurrency as a GitHub Actions workflow input (default 32).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pack 76 rows per INSERT statement × 100 statements per API call,
reducing API calls from ~941 to ~13 for 94k rows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wavekat-eason and others added 23 commits April 3, 2026 18:24
D1 has a stricter SQL variable limit than standard SQLite.
Use 100 params per statement (7 rows) instead of 999.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Display datasets with syncing/failed status indicators. Add has_audio
filter to clips query. Handle audio playback errors gracefully.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use wavesurfer.js to render an interactive waveform visualization
instead of the plain progress bar in the audio player.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously has_audio was only updated after all R2 uploads finished.
If the script crashed mid-upload, no clips would be marked. Now flushes
to D1 every 500 clips so progress is preserved on interruption.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Large dataset downloads can fail mid-stream when the remote server
closes the connection. Use HTTP Range headers to resume from the
partial file, with up to 10 retries and incremental backoff.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add authentication and terms acceptance to Common Voice Explorer
to comply with Common Voice usage terms (no re-hosting).

Worker:
- D1 migration for users + refresh_tokens tables
- GitHub OAuth code exchange, JWT access tokens, refresh token rotation
- Auth + terms middleware on all data endpoints
- Audio cache changed to private, no-store

Frontend:
- Login page with GitHub OAuth flow
- Terms acceptance gate with Mozilla/CC license links
- Auth callback handler (StrictMode-safe)
- All API requests carry JWT via authFetch wrapper
- Audio player fetches blobs with auth headers
- Explorer extracted from App with user info in header

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Word count split by spaces shows 1 for Chinese sentences.
Switch to character length which works correctly for all languages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace readFileSync with readline streaming in parseTsv to handle
large Common Voice TSV files that exceed Node.js's ~512MB string limit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When resuming a download and the file is already fully downloaded,
the server returns HTTP 416 (Range Not Satisfiable). Treat this as
a successful download instead of a fatal error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Serve the frontend from the Cloudflare Worker via static assets,
add a GitHub Actions workflow for automated deployment on push to
main, and document the full setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@wavekat-eason wavekat-eason merged commit 86fd2c5 into main Apr 4, 2026
4 checks passed
@wavekat-eason wavekat-eason deleted the feat/cv-dataset-explorer branch April 4, 2026 05:03
wavekat-eason pushed a commit that referenced this pull request Apr 4, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.0.10](v0.0.9...v0.0.10)
(2026-04-04)


### Features

* add Common Voice Explorer
([#24](#24))
([86fd2c5](86fd2c5))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant