feat: add Common Voice Explorer by wavekat-eason · Pull Request #24 · wavekat/wavekat-lab

wavekat-eason · 2026-04-04T04:28:19Z

Summary

Add a full-stack Common Voice dataset explorer: Cloudflare Worker API + React web app for browsing, filtering, and playing audio clips
Add sync script to download Common Voice datasets and store metadata in D1 / audio in R2
Add CI workflows for ephemeral Azure VM runner provisioning, dataset sync, and Cloudflare deployment
Include GitHub OAuth + terms acceptance gate for access control

Test plan

Verify CI workflows run successfully (provision, sync, deploy)
Test the web app: login, browse datasets, filter clips, play audio
Confirm D1 metadata and R2 audio are populated correctly via sync script

🤖 Generated with Claude Code

Browser-based tool to filter and listen to Common Voice clips for turn detection model training. Architecture: Cloudflare D1/R2 for data, ephemeral Azure VM as GitHub Actions runner for dataset sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- cv-runner-provision.yml: spins up Azure VM as self-hosted GH Actions runner with configurable size, disk, and auto-shutdown timer - cv-sync.yml: runs Common Voice dataset sync on the runner, then cleans up the VM automatically after completion Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add .github/workflows/README.md with setup guide and bash variables - Use repository variables (vars.) instead of secrets for AZURE_RESOURCE_GROUP and AZURE_LOCATION - Remove deprecated --sdk-auth flag from docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- azure/login@v2 → v3, azure/cli@v2 → v3 (Node.js 24 support) - VM size options updated from D*s_v3 to D*s_v5 (current gen) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Workflow README: Wrangler setup guide, CF token permissions, useful commands for VM management - cv-sync.yml: dropdown options for locale and split inputs, use vars for non-sensitive Cloudflare config - gitignore .wrangler/ directory Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TypeScript script that downloads a CV dataset from Mozilla Data Collective, parses TSV metadata into Cloudflare D1, and uploads MP3 clips to R2. Supports resumable runs (INSERT OR IGNORE, skip existing R2 objects) and parallel uploads. - tools/cv-explorer/scripts/sync.ts — main sync logic - cv-sync.yml — updated with dataset_id input and R2 credentials - README.md — R2 API token setup guide Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Note that R2 S3-compatible tokens require the dashboard (no wrangler CLI support), and the secret is only shown once. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The D1 /query API expects `{ batch: [...] }`, not a raw array. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove --locale arg, auto-detect from archive directory - Parse version from top-level dir (e.g. cv-corpus-25.0-2026-03-09) - Add datasets table to track sync status - Support --split all to sync every TSV found - Add --force flag to re-sync already synced datasets - DELETE + INSERT for clean re-syncs (no stale rows) - Add version column to clips table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ALTER TABLE to add version/dataset_id columns that may be absent from older schema, since CREATE TABLE IF NOT EXISTS won't alter existing tables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Verify D1 and R2 access at startup before downloading or processing any data, so bad credentials fail fast with a clear error message. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reduce default concurrency from 20 to 8 and add 3-attempt retry with exponential backoff to handle transient timeouts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Streams are non-retryable by the AWS SDK since they can't be replayed after consumption. Read files into Buffer so both SDK internal retries and our manual retry loop work correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add has_audio column to clips table and update it after successful R2 uploads so the DB knows which clips have audio available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Query D1 for already-synced splits before downloading the dataset archive. Add --force workflow input to override. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Hono API worker with D1/R2 bindings for browsing Common Voice clips, plus a React frontend with filtering, sorting, and audio playback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously all datasets extracted into a shared directory, causing locale detection to pick the wrong locale when multiple datasets shared the same corpus version. Each dataset now extracts into its own subdirectory, and stale extractions are cleaned up on each run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Show elapsed time and estimated remaining time during R2 uploads. Add r2_concurrency as a GitHub Actions workflow input (default 32). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pack 76 rows per INSERT statement × 100 statements per API call, reducing API calls from ~941 to ~13 for 94k rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

D1 has a stricter SQL variable limit than standard SQLite. Use 100 params per statement (7 rows) instead of 999. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Display datasets with syncing/failed status indicators. Add has_audio filter to clips query. Handle audio playback errors gracefully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use wavesurfer.js to render an interactive waveform visualization instead of the plain progress bar in the audio player. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously has_audio was only updated after all R2 uploads finished. If the script crashed mid-upload, no clips would be marked. Now flushes to D1 every 500 clips so progress is preserved on interruption. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Large dataset downloads can fail mid-stream when the remote server closes the connection. Use HTTP Range headers to resume from the partial file, with up to 10 retries and incremental backoff. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add authentication and terms acceptance to Common Voice Explorer to comply with Common Voice usage terms (no re-hosting). Worker: - D1 migration for users + refresh_tokens tables - GitHub OAuth code exchange, JWT access tokens, refresh token rotation - Auth + terms middleware on all data endpoints - Audio cache changed to private, no-store Frontend: - Login page with GitHub OAuth flow - Terms acceptance gate with Mozilla/CC license links - Auth callback handler (StrictMode-safe) - All API requests carry JWT via authFetch wrapper - Audio player fetches blobs with auth headers - Explorer extracted from App with user info in header Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Word count split by spaces shows 1 for Chinese sentences. Switch to character length which works correctly for all languages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace readFileSync with readline streaming in parseTsv to handle large Common Voice TSV files that exceed Node.js's ~512MB string limit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When resuming a download and the file is already fully downloaded, the server returns HTTP 416 (Range Not Satisfiable). Treat this as a successful download instead of a fatal error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Serve the frontend from the Cloudflare Worker via static assets, add a GitHub Actions workflow for automated deployment on push to main, and document the full setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

🤖 I have created a release *beep* *boop* --- ## [0.0.10](v0.0.9...v0.0.10) (2026-04-04) ### Features * add Common Voice Explorer ([#24](#24)) ([86fd2c5](86fd2c5)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

wavekat-eason and others added 30 commits April 3, 2026 11:57

ci: upgrade azure actions to v3 (Node.js 24), use v5 VM sizes

3c0f7c5

- azure/login@v2 → v3, azure/cli@v2 → v3 (Node.js 24 support) - VM size options updated from D*s_v3 to D*s_v5 (current gen) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: clarify R2 token setup in workflow README

5ddf1a1

Note that R2 S3-compatible tokens require the dashboard (no wrangler CLI support), and the secret is only shown once. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: wrap D1 batch insert body in object

69daeaa

The D1 /query API expects `{ batch: [...] }`, not a raw array. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add datasets table, auto-detect locale, split all

0aaed74

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: clean up NSG, public IP, VNET after VM deletion

bc69a1e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: upgrade Azure VM sizes to v6

8236e3e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: disable cleanup job during debugging

7fd13c9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: remove locale input, add split all option

64259eb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: make runner persistent for multiple jobs

b668e6c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: switch runner to ubuntu-latest-m

36d4c54

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: migrate D1 tables for missing columns

55c6a2d

ALTER TABLE to add version/dataset_id columns that may be absent from older schema, since CREATE TABLE IF NOT EXISTS won't alter existing tables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: preflight D1/R2 credential checks

00bab97

Verify D1 and R2 access at startup before downloading or processing any data, so bad credentials fail fast with a clear error message. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: R2 upload retry with backoff, lower concurrency

7f2ffb9

Reduce default concurrency from 20 to 8 and add 3-attempt retry with exponential backoff to handle transient timeouts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: track has_audio in D1 after R2 upload

e96355c

Add has_audio column to clips table and update it after successful R2 uploads so the DB knows which clips have audio available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Runs on "cv-sync"

ef5445c

feat: check sync status before download, add force option

cf93abc

Query D1 for already-synced splits before downloading the dataset archive. Add --force workflow input to override. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add cv-explorer worker and web app

6efd245

Hono API worker with D1/R2 bindings for browsing Common Voice clips, plus a React frontend with filtering, sorting, and audio playback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add ETA to R2 upload progress and concurrency option

21fbc8e

Show elapsed time and estimated remaining time during R2 uploads. Add r2_concurrency as a GitHub Actions workflow input (default 32). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add runner selection to cv-sync workflow

2bada18

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add ubuntu-latest to runner options

e61d801

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add ETA to D1 insert progress

b1f7eaa

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: use multi-row INSERT for D1 batch inserts

494359b

Pack 76 rows per INSERT statement × 100 statements per API call, reducing API calls from ~941 to ~13 for 94k rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wavekat-eason and others added 23 commits April 3, 2026 18:24

fix: reduce D1 params per statement to fit D1 limit

5f364e1

D1 has a stricter SQL variable limit than standard SQLite. Use 100 params per statement (7 rows) instead of 999. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add auth implementation plan and rebrand to Common Voice Explorer

a0ad824

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: show syncing datasets and add audio filter

de0069c

Display datasets with syncing/failed status indicators. Add has_audio filter to clips query. Handle audio playback errors gracefully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: replace audio progress bar with waveform display

17930bc

Use wavesurfer.js to render an interactive waveform visualization instead of the plain progress bar in the audio player. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add 128 and 256 R2 concurrency options

680e761

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add download button to audio player

84645a4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use char length instead of word count for CJK languages

9a87146

Word count split by spaces shows 1 for Chinese sentences. Switch to character length which works correctly for all languages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add v7 AMD VM size options for runner provisioning

9581410

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: stream TSV parsing to avoid string size limit

2b705c1

Replace readFileSync with readline streaming in parseTsv to handle large Common Voice TSV files that exceed Node.js's ~512MB string limit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: add GitHub OAuth client ID to wrangler config

d5a44bd

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add ASSETS binding to wrangler config for SPA routing

85e007f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: disable assets html_handling to prevent SPA redirect

0687daa

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: handle base64url encoding in JWT decode

70c8be0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: add CV Explorer to README and CI

cb36bf3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: use Common Voice Explorer instead of CV

fdba759

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add Google Analytics to Common Voice Explorer

32384de

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add GitHub repo link to Explorer header

d3a4f8f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: redesign CV Explorer login page with card layout

36b4887

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wavekat-eason merged commit 86fd2c5 into main Apr 4, 2026
4 checks passed

wavekat-eason deleted the feat/cv-dataset-explorer branch April 4, 2026 05:03

github-actions bot mentioned this pull request Apr 4, 2026

chore(main): release 0.0.10 #25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Common Voice Explorer#24

feat: add Common Voice Explorer#24
wavekat-eason merged 53 commits intomainfrom
feat/cv-dataset-explorer

wavekat-eason commented Apr 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wavekat-eason commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wavekat-eason commented Apr 4, 2026 •

edited

Loading