Merged
Conversation
Browser-based tool to filter and listen to Common Voice clips for turn detection model training. Architecture: Cloudflare D1/R2 for data, ephemeral Azure VM as GitHub Actions runner for dataset sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- cv-runner-provision.yml: spins up Azure VM as self-hosted GH Actions runner with configurable size, disk, and auto-shutdown timer - cv-sync.yml: runs Common Voice dataset sync on the runner, then cleans up the VM automatically after completion Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add .github/workflows/README.md with setup guide and bash variables - Use repository variables (vars.) instead of secrets for AZURE_RESOURCE_GROUP and AZURE_LOCATION - Remove deprecated --sdk-auth flag from docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- azure/login@v2 → v3, azure/cli@v2 → v3 (Node.js 24 support) - VM size options updated from D*s_v3 to D*s_v5 (current gen) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Workflow README: Wrangler setup guide, CF token permissions, useful commands for VM management - cv-sync.yml: dropdown options for locale and split inputs, use vars for non-sensitive Cloudflare config - gitignore .wrangler/ directory Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TypeScript script that downloads a CV dataset from Mozilla Data Collective, parses TSV metadata into Cloudflare D1, and uploads MP3 clips to R2. Supports resumable runs (INSERT OR IGNORE, skip existing R2 objects) and parallel uploads. - tools/cv-explorer/scripts/sync.ts — main sync logic - cv-sync.yml — updated with dataset_id input and R2 credentials - README.md — R2 API token setup guide Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Note that R2 S3-compatible tokens require the dashboard (no wrangler CLI support), and the secret is only shown once. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The D1 /query API expects `{ batch: [...] }`, not a raw array.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove --locale arg, auto-detect from archive directory - Parse version from top-level dir (e.g. cv-corpus-25.0-2026-03-09) - Add datasets table to track sync status - Support --split all to sync every TSV found - Add --force flag to re-sync already synced datasets - DELETE + INSERT for clean re-syncs (no stale rows) - Add version column to clips table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ALTER TABLE to add version/dataset_id columns that may be absent from older schema, since CREATE TABLE IF NOT EXISTS won't alter existing tables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verify D1 and R2 access at startup before downloading or processing any data, so bad credentials fail fast with a clear error message. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduce default concurrency from 20 to 8 and add 3-attempt retry with exponential backoff to handle transient timeouts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Streams are non-retryable by the AWS SDK since they can't be replayed after consumption. Read files into Buffer so both SDK internal retries and our manual retry loop work correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add has_audio column to clips table and update it after successful R2 uploads so the DB knows which clips have audio available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Query D1 for already-synced splits before downloading the dataset archive. Add --force workflow input to override. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hono API worker with D1/R2 bindings for browsing Common Voice clips, plus a React frontend with filtering, sorting, and audio playback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously all datasets extracted into a shared directory, causing locale detection to pick the wrong locale when multiple datasets shared the same corpus version. Each dataset now extracts into its own subdirectory, and stale extractions are cleaned up on each run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Show elapsed time and estimated remaining time during R2 uploads. Add r2_concurrency as a GitHub Actions workflow input (default 32). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pack 76 rows per INSERT statement × 100 statements per API call, reducing API calls from ~941 to ~13 for 94k rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
D1 has a stricter SQL variable limit than standard SQLite. Use 100 params per statement (7 rows) instead of 999. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Display datasets with syncing/failed status indicators. Add has_audio filter to clips query. Handle audio playback errors gracefully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use wavesurfer.js to render an interactive waveform visualization instead of the plain progress bar in the audio player. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously has_audio was only updated after all R2 uploads finished. If the script crashed mid-upload, no clips would be marked. Now flushes to D1 every 500 clips so progress is preserved on interruption. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Large dataset downloads can fail mid-stream when the remote server closes the connection. Use HTTP Range headers to resume from the partial file, with up to 10 retries and incremental backoff. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add authentication and terms acceptance to Common Voice Explorer to comply with Common Voice usage terms (no re-hosting). Worker: - D1 migration for users + refresh_tokens tables - GitHub OAuth code exchange, JWT access tokens, refresh token rotation - Auth + terms middleware on all data endpoints - Audio cache changed to private, no-store Frontend: - Login page with GitHub OAuth flow - Terms acceptance gate with Mozilla/CC license links - Auth callback handler (StrictMode-safe) - All API requests carry JWT via authFetch wrapper - Audio player fetches blobs with auth headers - Explorer extracted from App with user info in header Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Word count split by spaces shows 1 for Chinese sentences. Switch to character length which works correctly for all languages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace readFileSync with readline streaming in parseTsv to handle large Common Voice TSV files that exceed Node.js's ~512MB string limit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When resuming a download and the file is already fully downloaded, the server returns HTTP 416 (Range Not Satisfiable). Treat this as a successful download instead of a fatal error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Serve the frontend from the Cloudflare Worker via static assets, add a GitHub Actions workflow for automated deployment on push to main, and document the full setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
wavekat-eason
pushed a commit
that referenced
this pull request
Apr 4, 2026
🤖 I have created a release *beep* *boop* --- ## [0.0.10](v0.0.9...v0.0.10) (2026-04-04) ### Features * add Common Voice Explorer ([#24](#24)) ([86fd2c5](86fd2c5)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
🤖 Generated with Claude Code