Skip to content

feat(cli): v2.0 — batch mode, NDJSON, retry, rate limiting#8

Merged
gupsammy merged 8 commits into
mainfrom
feat/cli-v2-batch
Mar 5, 2026
Merged

feat(cli): v2.0 — batch mode, NDJSON, retry, rate limiting#8
gupsammy merged 8 commits into
mainfrom
feat/cli-v2-batch

Conversation

@gupsammy
Copy link
Copy Markdown
Owner

@gupsammy gupsammy commented Mar 4, 2026

Summary

  • Adds ezycopy batch subcommand for multi-URL extraction pipelines with concurrent workers, per-domain rate limiting, exponential backoff retry, and NDJSON streaming output
  • Hardens HTTP extractor with context support, Content-Type pre-check (skips PDFs/images), and response body size limits
  • Adds browser pool for batch mode: single Chrome instance reused across all URLs via concurrent tabs (replaces one-Chrome-per-URL)
  • Restructures codebase from monolithic main.go to cmd/ package following Cobra best practices; root command stays fully backward-compatible with v0.4.0

New capabilities

# Batch from file — 3 concurrent, NDJSON streamed to stdout
ezycopy batch --file urls.txt -j 3

# Batch from stdin
cat urls.txt | ezycopy batch -

# Single URL with JSON output
ezycopy https://example.com --json | jq '.title'

# Batch with browser + rate limiting
ezycopy batch --file sites.txt --browser -j 2 --rate-limit 2s

# Resume after partial failure
ezycopy batch --file urls.txt -o ./extracted/ --skip-existing

New files

File Lines Purpose
cmd/root.go 236 Root command, persistent flags, signal handling, --json
cmd/batch.go 444 Batch subcommand: worker pool, NDJSON writer, progress
cmd/types.go 61 JSON structs, exit codes
batch/ratelimit.go 62 Per-domain rate limiter
batch/retry.go 41 Exponential backoff retry
extractor/pool.go 102 Shared Chrome browser pool

Test plan

  • go build ./... — compiles clean
  • go vet ./... — no warnings
  • go test ./... — 19 tests passing (rate limiter, retry, HTTP hardening, batch utils)
  • ezycopy https://example.com — markdown output, backward-compatible
  • ezycopy https://example.com --json — structured JSON output
  • ezycopy batch url1 url2 — NDJSON with progress
  • Batch with non-HTML URL — inline unsupported_content_type error
  • Exit codes: 0 (all ok), 1 (partial failure)

🤖 Generated with Claude Code

Add `ezycopy batch` subcommand for multi-URL extraction pipelines:
- Concurrent worker pool with configurable parallelism (--concurrency)
- Per-domain rate limiting to avoid overwhelming servers
- Exponential backoff retry for transient failures
- NDJSON streaming output with inline errors
- Content-Type pre-check to skip non-HTML (PDFs, images)
- HTTP response body size limit (--max-body-size)
- Browser pool: single Chrome instance reused across batch URLs
- Signal handling: graceful Ctrl-C, force exit on second
- Structured JSON error output on stderr
- --json flag for root command structured output
- --skip-existing for batch resume after partial failure
- 19 unit tests covering rate limiter, retry, HTTP hardening

Restructures codebase: main.go -> cmd/ package (Cobra best practice).
Root command is fully backward-compatible with v0.4.0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 240968931d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/root.go Outdated
if !quiet {
fmt.Fprintln(os.Stderr, "Fetching page (browser)...")
}
pageResult, err = extractor.FetchPage(inputURL, timeout)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use --browser-ws for root command fetches

The root path ignores the --browser-ws flag even though it is exposed as a persistent flag: when browserFlag is true it always calls extractor.FetchPage(inputURL, timeout), which launches local Chrome/profile logic instead of connecting to the provided DevTools websocket. In practice, ezycopy --browser-ws ws://... <url> does not use the supplied browser and can fail on machines without a local Chrome profile.

Useful? React with 👍 / 👎.

Comment thread cmd/batch.go
Comment on lines +131 to +136
for i, u := range urls {
seq := i
rawURL := u

g.Go(func() error {
result := fetchAndExtract(gctx, seq, rawURL, limiter, retryCfg, pool, bodyLimit)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor fail-fast by stopping new job scheduling

With --fail-fast, workers return an error and cancel the errgroup context, but this loop still submits every URL to g.Go unconditionally, so later URLs are still scheduled and typically emit fetch_failed from the canceled context. That violates the advertised “Stop on first error” behavior and can flood output with synthetic failures on large batches.

Useful? React with 👍 / 👎.

Comment thread cmd/batch.go
}

// Extract article
article, err := extractor.ExtractArticle(pageResult.HTML, pageResult.URL)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject negative retries to prevent nil dereference

A negative --retries value makes MaxAttempts zero (retries + 1), so RetryConfig.Do returns nil without ever setting pageResult; execution then dereferences pageResult.HTML here and panics. This turns a user input validation issue into a runtime crash instead of a clean usage error.

Useful? React with 👍 / 👎.

gupsammy and others added 4 commits March 4, 2026 18:46
- Root command now uses BrowserPool when --browser-ws is provided
  instead of always launching local Chrome profile
- Batch --fail-fast stops scheduling new jobs once context is cancelled
- Negative --retries value is rejected with a clean usage error

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The code-review plugin needs explicit --allowedTools for the inline
comment MCP tool and gh CLI commands. Without this, all tool calls
were denied (37 permission denials) and no review comments posted.

Also added --comment flag to the prompt so the plugin actually posts
its findings rather than just printing to the action log.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread batch/ratelimit.go
Comment on lines +56 to +62
func extractDomain(rawURL string) string {
u, err := url.Parse(rawURL)
if err != nil {
return rawURL
}
return u.Hostname()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: extractDomain returns empty string for scheme-less URLs, causing incorrect rate limiting

Go's url.Parse is very permissive — it never returns an error for scheme-less inputs like "example.com/page". Instead, it parses the entire string as a relative path with an empty host, so u.Hostname() returns "". This means all scheme-less URLs (a realistic user input) share the same "" key in the rate limiter's last map, causing requests to completely unrelated domains to be unnecessarily serialized against each other.

The return rawURL fallback on error is effectively dead code since url.Parse almost never errors.

Suggested change
func extractDomain(rawURL string) string {
u, err := url.Parse(rawURL)
if err != nil {
return rawURL
}
return u.Hostname()
}
func extractDomain(rawURL string) string {
u, err := url.Parse(rawURL)
if err != nil || u.Hostname() == "" {
return rawURL
}
return u.Hostname()
}

Comment thread cmd/batch.go Outdated

// Write to file if output dir is set and extraction succeeded
if outputFlag != "" && result.Status == "ok" && result.Title != nil {
writeResultToFile(result)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: TOCTOU race condition in concurrent file writes

writeResultToFile is called directly from worker goroutines (inside g.Go(...)) without any synchronization. The function uses resolveCollision which performs a check-then-act sequence: os.Stat to test whether a file exists, then returns the path for the caller to write. Between the check and the write, another concurrent worker can execute the same sequence on the same path.

Concrete failure scenario: two URLs produce the same sanitized title. Both workers call resolveCollision concurrently, both observe via os.Stat that title.md does not yet exist, both receive the same path, and one silently overwrites the other's output with no error.

Suggested fix: Move writeResultToFile into the serialized writer goroutine (the one already reading from the results channel). Since that goroutine processes results one at a time, file writes there are inherently sequential and the race disappears. Alternatively, protect the resolveCollision + write sequence with a sync.Mutex.

gupsammy and others added 2 commits March 4, 2026 19:46
- extractDomain: return rawURL when hostname is empty (scheme-less URLs
  were all bucketed under "" in the rate limiter)
- Move writeResultToFile into serialized writer goroutine to eliminate
  TOCTOU race on resolveCollision from concurrent workers
- Add junk/ to .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 4, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Comment thread cmd/batch.go
Comment on lines +268 to +284
// isRetryable returns true for transient errors worth retrying.
func isRetryable(err error) bool {
var ee *extractor.ExtractionError
if errors.As(err, &ee) {
// Don't retry content-type or body-size errors
switch ee.Code {
case "unsupported_content_type", "body_too_large":
return false
}
// Retry 5xx errors
if strings.HasPrefix(ee.Code, "http_5") {
return true
}
}
// Retry generic fetch errors (timeouts, connection resets)
return true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: 4xx HTTP errors are incorrectly retried

When an ExtractionError has an HTTP 4xx code (e.g. http_404, http_403, http_401), the logic falls through: it doesn't match the switch cases and doesn't match the "http_5" prefix check, so it exits the if errors.As block and reaches the final return true. This means 404 Not Found, 403 Forbidden, 401 Unauthorized, etc. are all retried up to MaxAttempts times with exponential backoff — wasting time and generating unnecessary load for errors that will never succeed.

The comment on the final return true says "Retry generic fetch errors (timeouts, connection resets)", which was the intent — but ExtractionError types with 4xx codes reach it too.

Relevant code path in extractor/http.go: 4xx responses produce &ExtractionError{Code: "http_404", ...} which then routes through isRetryablereturn true.

Suggested change
// isRetryable returns true for transient errors worth retrying.
func isRetryable(err error) bool {
var ee *extractor.ExtractionError
if errors.As(err, &ee) {
// Don't retry content-type or body-size errors
switch ee.Code {
case "unsupported_content_type", "body_too_large":
return false
}
// Retry 5xx errors
if strings.HasPrefix(ee.Code, "http_5") {
return true
}
}
// Retry generic fetch errors (timeouts, connection resets)
return true
}
// isRetryable returns true for transient errors worth retrying.
func isRetryable(err error) bool {
var ee *extractor.ExtractionError
if errors.As(err, &ee) {
// Don't retry content-type or body-size errors
switch ee.Code {
case "unsupported_content_type", "body_too_large":
return false
}
// Retry 5xx errors
if strings.HasPrefix(ee.Code, "http_5") {
return true
}
// Don't retry other known HTTP errors (4xx, etc.)
return false
}
// Retry generic fetch errors (timeouts, connection resets)
return true
}

ExtractionError with http_4xx codes fell through the errors.As block
and hit the outer `return true`, causing 404/403/401 to be retried
with exponential backoff. Added `return false` at end of the errors.As
block so only explicit 5xx and generic non-HTTP errors are retried.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gupsammy gupsammy merged commit 049db1a into main Mar 5, 2026
1 check passed
@gupsammy gupsammy deleted the feat/cli-v2-batch branch March 5, 2026 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant