feat(cli): v2.0 — batch mode, NDJSON, retry, rate limiting#8
Conversation
Add `ezycopy batch` subcommand for multi-URL extraction pipelines: - Concurrent worker pool with configurable parallelism (--concurrency) - Per-domain rate limiting to avoid overwhelming servers - Exponential backoff retry for transient failures - NDJSON streaming output with inline errors - Content-Type pre-check to skip non-HTML (PDFs, images) - HTTP response body size limit (--max-body-size) - Browser pool: single Chrome instance reused across batch URLs - Signal handling: graceful Ctrl-C, force exit on second - Structured JSON error output on stderr - --json flag for root command structured output - --skip-existing for batch resume after partial failure - 19 unit tests covering rate limiter, retry, HTTP hardening Restructures codebase: main.go -> cmd/ package (Cobra best practice). Root command is fully backward-compatible with v0.4.0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 240968931d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !quiet { | ||
| fmt.Fprintln(os.Stderr, "Fetching page (browser)...") | ||
| } | ||
| pageResult, err = extractor.FetchPage(inputURL, timeout) |
There was a problem hiding this comment.
Use --browser-ws for root command fetches
The root path ignores the --browser-ws flag even though it is exposed as a persistent flag: when browserFlag is true it always calls extractor.FetchPage(inputURL, timeout), which launches local Chrome/profile logic instead of connecting to the provided DevTools websocket. In practice, ezycopy --browser-ws ws://... <url> does not use the supplied browser and can fail on machines without a local Chrome profile.
Useful? React with 👍 / 👎.
| for i, u := range urls { | ||
| seq := i | ||
| rawURL := u | ||
|
|
||
| g.Go(func() error { | ||
| result := fetchAndExtract(gctx, seq, rawURL, limiter, retryCfg, pool, bodyLimit) |
There was a problem hiding this comment.
Honor fail-fast by stopping new job scheduling
With --fail-fast, workers return an error and cancel the errgroup context, but this loop still submits every URL to g.Go unconditionally, so later URLs are still scheduled and typically emit fetch_failed from the canceled context. That violates the advertised “Stop on first error” behavior and can flood output with synthetic failures on large batches.
Useful? React with 👍 / 👎.
| } | ||
|
|
||
| // Extract article | ||
| article, err := extractor.ExtractArticle(pageResult.HTML, pageResult.URL) |
There was a problem hiding this comment.
Reject negative retries to prevent nil dereference
A negative --retries value makes MaxAttempts zero (retries + 1), so RetryConfig.Do returns nil without ever setting pageResult; execution then dereferences pageResult.HTML here and panics. This turns a user input validation issue into a runtime crash instead of a clean usage error.
Useful? React with 👍 / 👎.
- Root command now uses BrowserPool when --browser-ws is provided instead of always launching local Chrome profile - Batch --fail-fast stops scheduling new jobs once context is cancelled - Negative --retries value is rejected with a clean usage error Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The code-review plugin needs explicit --allowedTools for the inline comment MCP tool and gh CLI commands. Without this, all tool calls were denied (37 permission denials) and no review comments posted. Also added --comment flag to the prompt so the plugin actually posts its findings rather than just printing to the action log. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| func extractDomain(rawURL string) string { | ||
| u, err := url.Parse(rawURL) | ||
| if err != nil { | ||
| return rawURL | ||
| } | ||
| return u.Hostname() | ||
| } |
There was a problem hiding this comment.
Bug: extractDomain returns empty string for scheme-less URLs, causing incorrect rate limiting
Go's url.Parse is very permissive — it never returns an error for scheme-less inputs like "example.com/page". Instead, it parses the entire string as a relative path with an empty host, so u.Hostname() returns "". This means all scheme-less URLs (a realistic user input) share the same "" key in the rate limiter's last map, causing requests to completely unrelated domains to be unnecessarily serialized against each other.
The return rawURL fallback on error is effectively dead code since url.Parse almost never errors.
| func extractDomain(rawURL string) string { | |
| u, err := url.Parse(rawURL) | |
| if err != nil { | |
| return rawURL | |
| } | |
| return u.Hostname() | |
| } | |
| func extractDomain(rawURL string) string { | |
| u, err := url.Parse(rawURL) | |
| if err != nil || u.Hostname() == "" { | |
| return rawURL | |
| } | |
| return u.Hostname() | |
| } |
|
|
||
| // Write to file if output dir is set and extraction succeeded | ||
| if outputFlag != "" && result.Status == "ok" && result.Title != nil { | ||
| writeResultToFile(result) |
There was a problem hiding this comment.
Bug: TOCTOU race condition in concurrent file writes
writeResultToFile is called directly from worker goroutines (inside g.Go(...)) without any synchronization. The function uses resolveCollision which performs a check-then-act sequence: os.Stat to test whether a file exists, then returns the path for the caller to write. Between the check and the write, another concurrent worker can execute the same sequence on the same path.
Concrete failure scenario: two URLs produce the same sanitized title. Both workers call resolveCollision concurrently, both observe via os.Stat that title.md does not yet exist, both receive the same path, and one silently overwrites the other's output with no error.
Suggested fix: Move writeResultToFile into the serialized writer goroutine (the one already reading from the results channel). Since that goroutine processes results one at a time, file writes there are inherently sequential and the race disappears. Alternatively, protect the resolveCollision + write sequence with a sync.Mutex.
- extractDomain: return rawURL when hostname is empty (scheme-less URLs were all bucketed under "" in the rate limiter) - Move writeResultToFile into serialized writer goroutine to eliminate TOCTOU race on resolveCollision from concurrent workers - Add junk/ to .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
| // isRetryable returns true for transient errors worth retrying. | ||
| func isRetryable(err error) bool { | ||
| var ee *extractor.ExtractionError | ||
| if errors.As(err, &ee) { | ||
| // Don't retry content-type or body-size errors | ||
| switch ee.Code { | ||
| case "unsupported_content_type", "body_too_large": | ||
| return false | ||
| } | ||
| // Retry 5xx errors | ||
| if strings.HasPrefix(ee.Code, "http_5") { | ||
| return true | ||
| } | ||
| } | ||
| // Retry generic fetch errors (timeouts, connection resets) | ||
| return true | ||
| } |
There was a problem hiding this comment.
Bug: 4xx HTTP errors are incorrectly retried
When an ExtractionError has an HTTP 4xx code (e.g. http_404, http_403, http_401), the logic falls through: it doesn't match the switch cases and doesn't match the "http_5" prefix check, so it exits the if errors.As block and reaches the final return true. This means 404 Not Found, 403 Forbidden, 401 Unauthorized, etc. are all retried up to MaxAttempts times with exponential backoff — wasting time and generating unnecessary load for errors that will never succeed.
The comment on the final return true says "Retry generic fetch errors (timeouts, connection resets)", which was the intent — but ExtractionError types with 4xx codes reach it too.
Relevant code path in extractor/http.go: 4xx responses produce &ExtractionError{Code: "http_404", ...} which then routes through isRetryable → return true.
| // isRetryable returns true for transient errors worth retrying. | |
| func isRetryable(err error) bool { | |
| var ee *extractor.ExtractionError | |
| if errors.As(err, &ee) { | |
| // Don't retry content-type or body-size errors | |
| switch ee.Code { | |
| case "unsupported_content_type", "body_too_large": | |
| return false | |
| } | |
| // Retry 5xx errors | |
| if strings.HasPrefix(ee.Code, "http_5") { | |
| return true | |
| } | |
| } | |
| // Retry generic fetch errors (timeouts, connection resets) | |
| return true | |
| } | |
| // isRetryable returns true for transient errors worth retrying. | |
| func isRetryable(err error) bool { | |
| var ee *extractor.ExtractionError | |
| if errors.As(err, &ee) { | |
| // Don't retry content-type or body-size errors | |
| switch ee.Code { | |
| case "unsupported_content_type", "body_too_large": | |
| return false | |
| } | |
| // Retry 5xx errors | |
| if strings.HasPrefix(ee.Code, "http_5") { | |
| return true | |
| } | |
| // Don't retry other known HTTP errors (4xx, etc.) | |
| return false | |
| } | |
| // Retry generic fetch errors (timeouts, connection resets) | |
| return true | |
| } |
ExtractionError with http_4xx codes fell through the errors.As block and hit the outer `return true`, causing 404/403/401 to be retried with exponential backoff. Added `return false` at end of the errors.As block so only explicit 5xx and generic non-HTTP errors are retried. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
ezycopy batchsubcommand for multi-URL extraction pipelines with concurrent workers, per-domain rate limiting, exponential backoff retry, and NDJSON streaming outputmain.gotocmd/package following Cobra best practices; root command stays fully backward-compatible with v0.4.0New capabilities
New files
cmd/root.go--jsoncmd/batch.gocmd/types.gobatch/ratelimit.gobatch/retry.goextractor/pool.goTest plan
go build ./...— compiles cleango vet ./...— no warningsgo test ./...— 19 tests passing (rate limiter, retry, HTTP hardening, batch utils)ezycopy https://example.com— markdown output, backward-compatibleezycopy https://example.com --json— structured JSON outputezycopy batch url1 url2— NDJSON with progressunsupported_content_typeerror🤖 Generated with Claude Code