feat(cli): v2.0 — batch mode, NDJSON, retry, rate limiting by gupsammy · Pull Request #8 · gupsammy/EzyCopy

gupsammy · 2026-03-04T12:36:48Z

Summary

Adds ezycopy batch subcommand for multi-URL extraction pipelines with concurrent workers, per-domain rate limiting, exponential backoff retry, and NDJSON streaming output
Hardens HTTP extractor with context support, Content-Type pre-check (skips PDFs/images), and response body size limits
Adds browser pool for batch mode: single Chrome instance reused across all URLs via concurrent tabs (replaces one-Chrome-per-URL)
Restructures codebase from monolithic main.go to cmd/ package following Cobra best practices; root command stays fully backward-compatible with v0.4.0

New capabilities

# Batch from file — 3 concurrent, NDJSON streamed to stdout
ezycopy batch --file urls.txt -j 3

# Batch from stdin
cat urls.txt | ezycopy batch -

# Single URL with JSON output
ezycopy https://example.com --json | jq '.title'

# Batch with browser + rate limiting
ezycopy batch --file sites.txt --browser -j 2 --rate-limit 2s

# Resume after partial failure
ezycopy batch --file urls.txt -o ./extracted/ --skip-existing

New files

File	Lines	Purpose
`cmd/root.go`	236	Root command, persistent flags, signal handling, `--json`
`cmd/batch.go`	444	Batch subcommand: worker pool, NDJSON writer, progress
`cmd/types.go`	61	JSON structs, exit codes
`batch/ratelimit.go`	62	Per-domain rate limiter
`batch/retry.go`	41	Exponential backoff retry
`extractor/pool.go`	102	Shared Chrome browser pool

Test plan

go build ./... — compiles clean
go vet ./... — no warnings
go test ./... — 19 tests passing (rate limiter, retry, HTTP hardening, batch utils)
ezycopy https://example.com — markdown output, backward-compatible
ezycopy https://example.com --json — structured JSON output
ezycopy batch url1 url2 — NDJSON with progress
Batch with non-HTML URL — inline unsupported_content_type error
Exit codes: 0 (all ok), 1 (partial failure)

🤖 Generated with Claude Code

Add `ezycopy batch` subcommand for multi-URL extraction pipelines: - Concurrent worker pool with configurable parallelism (--concurrency) - Per-domain rate limiting to avoid overwhelming servers - Exponential backoff retry for transient failures - NDJSON streaming output with inline errors - Content-Type pre-check to skip non-HTML (PDFs, images) - HTTP response body size limit (--max-body-size) - Browser pool: single Chrome instance reused across batch URLs - Signal handling: graceful Ctrl-C, force exit on second - Structured JSON error output on stderr - --json flag for root command structured output - --skip-existing for batch resume after partial failure - 19 unit tests covering rate limiter, retry, HTTP hardening Restructures codebase: main.go -> cmd/ package (Cobra best practice). Root command is fully backward-compatible with v0.4.0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 240968931d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-04T12:41:13Z

+		if !quiet {
+			fmt.Fprintln(os.Stderr, "Fetching page (browser)...")
+		}
+		pageResult, err = extractor.FetchPage(inputURL, timeout)


Use --browser-ws for root command fetches

The root path ignores the --browser-ws flag even though it is exposed as a persistent flag: when browserFlag is true it always calls extractor.FetchPage(inputURL, timeout), which launches local Chrome/profile logic instead of connecting to the provided DevTools websocket. In practice, ezycopy --browser-ws ws://... <url> does not use the supplied browser and can fail on machines without a local Chrome profile.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-04T12:41:13Z

+	for i, u := range urls {
+		seq := i
+		rawURL := u
+
+		g.Go(func() error {
+			result := fetchAndExtract(gctx, seq, rawURL, limiter, retryCfg, pool, bodyLimit)


Honor fail-fast by stopping new job scheduling

With --fail-fast, workers return an error and cancel the errgroup context, but this loop still submits every URL to g.Go unconditionally, so later URLs are still scheduled and typically emit fetch_failed from the canceled context. That violates the advertised “Stop on first error” behavior and can flood output with synthetic failures on large batches.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-04T12:41:13Z

+	}
+
+	// Extract article
+	article, err := extractor.ExtractArticle(pageResult.HTML, pageResult.URL)


Reject negative retries to prevent nil dereference

A negative --retries value makes MaxAttempts zero (retries + 1), so RetryConfig.Do returns nil without ever setting pageResult; execution then dereferences pageResult.HTML here and panics. This turns a user input validation issue into a runtime crash instead of a clean usage error.

Useful? React with 👍 / 👎.

- Root command now uses BrowserPool when --browser-ws is provided instead of always launching local Chrome profile - Batch --fail-fast stops scheduling new jobs once context is cancelled - Negative --retries value is rejected with a clean usage error Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The code-review plugin needs explicit --allowedTools for the inline comment MCP tool and gh CLI commands. Without this, all tool calls were denied (37 permission denials) and no review comments posted. Also added --comment flag to the prompt so the plugin actually posts its findings rather than just printing to the action log. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-04T13:54:26Z

+func extractDomain(rawURL string) string {
+	u, err := url.Parse(rawURL)
+	if err != nil {
+		return rawURL
+	}
+	return u.Hostname()
+}


Bug: extractDomain returns empty string for scheme-less URLs, causing incorrect rate limiting

Go's url.Parse is very permissive — it never returns an error for scheme-less inputs like "example.com/page". Instead, it parses the entire string as a relative path with an empty host, so u.Hostname() returns "". This means all scheme-less URLs (a realistic user input) share the same "" key in the rate limiter's last map, causing requests to completely unrelated domains to be unnecessarily serialized against each other.

The return rawURL fallback on error is effectively dead code since url.Parse almost never errors.

Suggested change

func extractDomain(rawURL string) string {

u, err := url.Parse(rawURL)

if err != nil {

return rawURL

}

return u.Hostname()

}

func extractDomain(rawURL string) string {

u, err := url.Parse(rawURL)

if err != nil || u.Hostname() == "" {

return rawURL

}

return u.Hostname()

}

claude · 2026-03-04T13:54:37Z

+
+			// Write to file if output dir is set and extraction succeeded
+			if outputFlag != "" && result.Status == "ok" && result.Title != nil {
+				writeResultToFile(result)


Bug: TOCTOU race condition in concurrent file writes

writeResultToFile is called directly from worker goroutines (inside g.Go(...)) without any synchronization. The function uses resolveCollision which performs a check-then-act sequence: os.Stat to test whether a file exists, then returns the path for the caller to write. Between the check and the write, another concurrent worker can execute the same sequence on the same path.

Concrete failure scenario: two URLs produce the same sanitized title. Both workers call resolveCollision concurrently, both observe via os.Stat that title.md does not yet exist, both receive the same path, and one silently overwrites the other's output with no error.

Suggested fix: Move writeResultToFile into the serialized writer goroutine (the one already reading from the results channel). Since that goroutine processes results one at a time, file writes there are inherently sequential and the race disappears. Alternatively, protect the resolveCollision + write sequence with a sync.Mutex.

- extractDomain: return rawURL when hostname is empty (scheme-less URLs were all bucketed under "" in the rate limiter) - Move writeResultToFile into serialized writer goroutine to eliminate TOCTOU race on resolveCollision from concurrent workers - Add junk/ to .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude · 2026-03-04T14:23:10Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

claude · 2026-03-04T14:26:57Z

+// isRetryable returns true for transient errors worth retrying.
+func isRetryable(err error) bool {
+	var ee *extractor.ExtractionError
+	if errors.As(err, &ee) {
+		// Don't retry content-type or body-size errors
+		switch ee.Code {
+		case "unsupported_content_type", "body_too_large":
+			return false
+		}
+		// Retry 5xx errors
+		if strings.HasPrefix(ee.Code, "http_5") {
+			return true
+		}
+	}
+	// Retry generic fetch errors (timeouts, connection resets)
+	return true
+}


Bug: 4xx HTTP errors are incorrectly retried

When an ExtractionError has an HTTP 4xx code (e.g. http_404, http_403, http_401), the logic falls through: it doesn't match the switch cases and doesn't match the "http_5" prefix check, so it exits the if errors.As block and reaches the final return true. This means 404 Not Found, 403 Forbidden, 401 Unauthorized, etc. are all retried up to MaxAttempts times with exponential backoff — wasting time and generating unnecessary load for errors that will never succeed.

The comment on the final return true says "Retry generic fetch errors (timeouts, connection resets)", which was the intent — but ExtractionError types with 4xx codes reach it too.

Relevant code path in extractor/http.go: 4xx responses produce &ExtractionError{Code: "http_404", ...} which then routes through isRetryable → return true.

Suggested change

// isRetryable returns true for transient errors worth retrying.

func isRetryable(err error) bool {

var ee *extractor.ExtractionError

if errors.As(err, &ee) {

// Don't retry content-type or body-size errors

switch ee.Code {

case "unsupported_content_type", "body_too_large":

return false

}

// Retry 5xx errors

if strings.HasPrefix(ee.Code, "http_5") {

return true

}

}

// Retry generic fetch errors (timeouts, connection resets)

return true

}

// isRetryable returns true for transient errors worth retrying.

func isRetryable(err error) bool {

var ee *extractor.ExtractionError

if errors.As(err, &ee) {

// Don't retry content-type or body-size errors

switch ee.Code {

case "unsupported_content_type", "body_too_large":

return false

}

// Retry 5xx errors

if strings.HasPrefix(ee.Code, "http_5") {

return true

}

// Don't retry other known HTTP errors (4xx, etc.)

return false

}

// Retry generic fetch errors (timeouts, connection resets)

return true

}

ExtractionError with http_4xx codes fell through the errors.As block and hit the outer `return true`, causing 404/403/401 to be retried with exponential backoff. Added `return false` at end of the errors.As block so only explicit 5xx and generic non-HTTP errors are retried. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Mar 4, 2026

View reviewed changes

gupsammy and others added 4 commits March 4, 2026 18:46

Merge branch 'main' into feat/cli-v2-batch

28448b6

Merge branch 'main' into feat/cli-v2-batch

ac52e6d

claude Bot reviewed Mar 4, 2026

View reviewed changes

gupsammy and others added 2 commits March 4, 2026 19:46

Merge branch 'main' into feat/cli-v2-batch

0b0e87c

claude Bot reviewed Mar 4, 2026

View reviewed changes

gupsammy merged commit 049db1a into main Mar 5, 2026
1 check passed

gupsammy deleted the feat/cli-v2-batch branch March 5, 2026 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): v2.0 — batch mode, NDJSON, retry, rate limiting#8

feat(cli): v2.0 — batch mode, NDJSON, retry, rate limiting#8
gupsammy merged 8 commits into
mainfrom
feat/cli-v2-batch

gupsammy commented Mar 4, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Uh oh!

claude Bot Mar 4, 2026

Uh oh!

claude Bot Mar 4, 2026

Uh oh!

claude Bot commented Mar 4, 2026

Uh oh!

claude Bot Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gupsammy commented Mar 4, 2026

Summary

New capabilities

New files

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Mar 4, 2026

Code review

Uh oh!

claude Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant