feat: improve scraping reliability and same-company account handling#845
feat: improve scraping reliability and same-company account handling#845daniel-hauser wants to merge 10 commits intomainfrom
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 17 minutes and 22 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (9)
📝 WalkthroughWalkthroughIncreased session-storage wait timeouts in the Visa CAL scraper and changed scraping scheduling to group accounts by company, processing each company's accounts sequentially while companies run concurrently; added a retry flag and bumped TS target. Changes
Sequence Diagram(s)sequenceDiagram
participant Orchestrator as Orchestrator
participant Grouper as GroupingLogic
participant Parallel as parallelLimit
participant CompanyTask as CompanyTask
participant Scraper as scrapeAccount
participant Emitter as scrapeStatusChanged
Orchestrator->>Grouper: group accounts by companyId
Grouper-->>Orchestrator: companyGroups
Orchestrator->>Emitter: emit initial status (grouped waiting)
Orchestrator->>Parallel: run CompanyTask for each company concurrently
Parallel->>CompanyTask: start company task
CompanyTask->>Emitter: update status for current account
CompanyTask->>Scraper: scrape current account
Scraper-->>CompanyTask: result / error
alt error & retry enabled & type Generic
CompanyTask->>Emitter: update status to retrying
CompanyTask->>CompanyTask: wait RETRY_DELAY_MS
CompanyTask->>Scraper: retry scrape
Scraper-->>CompanyTask: result
end
CompanyTask->>Emitter: set next account status to waiting
CompanyTask->>CompanyTask: wait SAME_COMPANY_DELAY_MS
CompanyTask->>Scraper: scrape next account
CompanyTask-->>Parallel: return company results (array)
Parallel-->>Orchestrator: all company results (arrays)
Orchestrator->>Orchestrator: flatten results, finalize, emit final status
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
🐳 Docker image built and pushed: 📏 Size: 1.14GB |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/scraper/index.ts (1)
56-66: ClarifyparallelScraperssemantics after scheduler change.This now limits concurrent company groups, not concurrent individual accounts. Please update naming and/or config description to avoid tuning confusion.
Also applies to: 124-127
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scraper/index.ts` around lines 56 - 66, The config name/variable parallelScrapers is now misleading because the scheduler limits concurrent companyGroups (accounts grouped by companyId) rather than individual accounts; rename the setting and any variables/usages from parallelScrapers to something like parallelCompanyGroups (and update related references in scheduler setup and any docs/README/config schema), update the config description to explicitly state it limits concurrent company groups (not accounts), and ensure any code/comments referencing companyGroups, scheduler, or account-level concurrency reflect the new name and semantics (also apply the same rename/update to the other occurrence referenced around the scheduler logic where parallelScrapers is used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scraper/index.ts`:
- Around line 108-129: The flattened groupResults from groupTasks (built from
companyGroups) currently yields company-group order instead of the original
input order; change the aggregation so each scrapeOne(account, index) result is
placed into a preallocated results array at its original index (use the provided
index from members) rather than pushing/group-order flattening—i.e., create an
output array sized to the total inputs, have each group task write its
AccountScrapeResult into results[index], and after parallelLimit(...) return
results (or map over indices) so callers receive results in the original input
order; adjust references to groupResults/results accordingly.
---
Nitpick comments:
In `@src/scraper/index.ts`:
- Around line 56-66: The config name/variable parallelScrapers is now misleading
because the scheduler limits concurrent companyGroups (accounts grouped by
companyId) rather than individual accounts; rename the setting and any
variables/usages from parallelScrapers to something like parallelCompanyGroups
(and update related references in scheduler setup and any docs/README/config
schema), update the config description to explicitly state it limits concurrent
company groups (not accounts), and ensure any code/comments referencing
companyGroups, scheduler, or account-level concurrency reflect the new name and
semantics (also apply the same rename/update to the other occurrence referenced
around the scheduler logic where parallelScrapers is used).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| const groupTasks = [...companyGroups.entries()].map( | ||
| ([, members]) => | ||
| async () => { | ||
| const groupResults: AccountScrapeResult[] = []; | ||
| for (let j = 0; j < members.length; j++) { | ||
| const { account, index } = members[j]; | ||
| if (j > 0) { | ||
| status[index] = `[${account.companyId}] ⏳ Waiting`; | ||
| await scrapeStatusChanged?.(status); | ||
| } | ||
| groupResults.push(await scrapeOne(account, index)); | ||
| } | ||
| return groupResults; | ||
| }, | ||
| ); | ||
|
|
||
| const groupResults = await parallelLimit< | ||
| (typeof groupTasks)[number], | ||
| AccountScrapeResult[][] | ||
| >(groupTasks, Number(parallelScrapers)); | ||
|
|
||
| const results = groupResults.flat(); |
There was a problem hiding this comment.
Preserve original account order in returned results.
The new flattening returns company-group order, not original input order. For interleaved companies, this can misalign any caller that correlates by index.
💡 Proposed fix (keep grouped execution, restore stable output order)
- const groupTasks = [...companyGroups.entries()].map(
+ const groupTasks = [...companyGroups.entries()].map(
([, members]) =>
async () => {
- const groupResults: AccountScrapeResult[] = [];
+ const groupResults: Array<{
+ index: number;
+ result: AccountScrapeResult;
+ }> = [];
for (let j = 0; j < members.length; j++) {
const { account, index } = members[j];
if (j > 0) {
status[index] = `[${account.companyId}] ⏳ Waiting`;
await scrapeStatusChanged?.(status);
}
- groupResults.push(await scrapeOne(account, index));
+ groupResults.push({ index, result: await scrapeOne(account, index) });
}
return groupResults;
},
);
- const groupResults = await parallelLimit<
+ const groupResults = await parallelLimit<
(typeof groupTasks)[number],
- AccountScrapeResult[][]
+ Array<{ index: number; result: AccountScrapeResult }>[]
>(groupTasks, Number(parallelScrapers));
- const results = groupResults.flat();
+ const results = groupResults
+ .flat()
+ .sort((a, b) => a.index - b.index)
+ .map(({ result }) => result);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/scraper/index.ts` around lines 108 - 129, The flattened groupResults from
groupTasks (built from companyGroups) currently yields company-group order
instead of the original input order; change the aggregation so each
scrapeOne(account, index) result is placed into a preallocated results array at
its original index (use the provided index from members) rather than
pushing/group-order flattening—i.e., create an output array sized to the total
inputs, have each group task write its AccountScrapeResult into results[index],
and after parallelLimit(...) return results (or map over indices) so callers
receive results in the original input order; adjust references to
groupResults/results accordingly.
09fb6bb to
50bea7e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/scraper/index.ts (1)
121-126:⚠️ Potential issue | 🟠 MajorStill unresolved: return
resultsin the originalaccountsorder.This is still flattening by company group, so the returned array no longer lines up with the caller's input order. Any consumer that correlates
results[i]withaccounts[i]will now be wrong; please place eachscrapeOne(account, index)result back at its original index before returning.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scraper/index.ts` around lines 121 - 126, The flattened groupResults currently returns results grouped by company rather than matching the original accounts order; change the aggregation so each scrapeOne(account, index) stores its AccountScrapeResult at its original index and then return an array assembled by original indices. Specifically, when building groupTasks and handling parallelLimit(groupTasks, ...), ensure each task calls scrapeOne(account, index) with the original account index and writes its result into a preallocated results array (using the index) instead of relying on groupResults.flat(); return that indexed results array so results[i] corresponds to accounts[i].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scraper/index.ts`:
- Around line 106-115: The current groupTasks implementation sleeps with await
setTimeout(SAME_COMPANY_DELAY_MS) inside the worker, occupying one of the
limited global worker slots; change to enforce same-company serialization
without blocking the worker by implementing a per-company chain/queue: create a
companyChains map keyed by account.companyId (or a per-company mutex/Promise
chain) and, for each account, enqueue its work by appending to
companyChains[companyId] a Promise that runs the scrape for that account then
returns after SAME_COMPANY_DELAY_MS; in groupTasks (and functions like the
account processing loop that call scrapeStatusChanged) remove the in-worker
await sleep and instead await the per-company chain Promise so the global worker
is free between chained tasks; reference symbols: groupTasks, companyGroups,
SAME_COMPANY_DELAY_MS, scrapeStatusChanged, and the account processing loop to
locate and replace the blocking await with a per-company chaining mechanism.
---
Duplicate comments:
In `@src/scraper/index.ts`:
- Around line 121-126: The flattened groupResults currently returns results
grouped by company rather than matching the original accounts order; change the
aggregation so each scrapeOne(account, index) stores its AccountScrapeResult at
its original index and then return an array assembled by original indices.
Specifically, when building groupTasks and handling parallelLimit(groupTasks,
...), ensure each task calls scrapeOne(account, index) with the original account
index and writes its result into a preallocated results array (using the index)
instead of relying on groupResults.flat(); return that indexed results array so
results[i] corresponds to accounts[i].
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2b954238-260e-4c51-ad93-0d2f4a1f9742
📒 Files selected for processing (2)
src/scraper/index.tstsconfig.json
✅ Files skipped from review due to trivial changes (1)
- tsconfig.json
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/scraper/index.ts (1)
142-146: Redundant status update for "Waiting" accounts.Line 143 sets
status[index]to"⏳ Waiting"again, but this was already set in the initial loop at lines 72-74 for all non-first members. The status is correct, but this line is duplicative.However, there may be a subtle difference: if
parallelScrapers > 1and groups run concurrently, by the time we reach a subsequent member inrest, its status might have been displayed as "Waiting" initially but the user may benefit from seeing the update again after the previous member completes. This makes the redundancy acceptable for UX clarity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scraper/index.ts` around lines 142 - 146, The status assignment status[index] = `[${account.companyId}] ⏳ Waiting`; inside the rest loop is redundant because non-first members were already initialized to "Waiting", so either remove that assignment or guard it so it only runs when it will change UI: update it only if status[index] !== `[${account.companyId}] ⏳ Waiting` or when running with concurrent scrapers (e.g., parallelScrapers > 1) to refresh the UI; keep the subsequent await scrapeStatusChanged?.(status), SAME_COMPANY_DELAY_MS and groupResults.push(await scrapeOne(account, index)) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/scraper/index.ts`:
- Around line 142-146: The status assignment status[index] =
`[${account.companyId}] ⏳ Waiting`; inside the rest loop is redundant because
non-first members were already initialized to "Waiting", so either remove that
assignment or guard it so it only runs when it will change UI: update it only if
status[index] !== `[${account.companyId}] ⏳ Waiting` or when running with
concurrent scrapers (e.g., parallelScrapers > 1) to refresh the UI; keep the
subsequent await scrapeStatusChanged?.(status), SAME_COMPANY_DELAY_MS and
groupResults.push(await scrapeOne(account, index)) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 07c4b18f-d9bd-419a-9e0b-0bae515514c1
📒 Files selected for processing (5)
src/bot/storage/sheets.test.tssrc/config.schema.tssrc/config.tssrc/scraper/index.tssrc/types.ts
✅ Files skipped from review due to trivial changes (2)
- src/types.ts
- src/bot/storage/sheets.test.ts
802cb97 to
5a5e685
Compare
Apply change from eshaham/israeli-bank-scrapers#1066: - Increase init data timeout from 10s to 20s - Increase auth header timeout from 10s to 15s Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Accounts sharing the same companyId now run sequentially within their group while different companies still run in parallel (up to maxParallelScrapers). This avoids conflicts when multiple accounts for the same company share a single browser/IP. Queued accounts show a '⏳ Waiting' status until their turn. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace manual Map-based grouping with Object.groupBy (ES2024). Bump tsconfig target to es2024 to support it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a short delay before starting the next account in a same-company group, giving the previous session time to fully close. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Close each browser context after scraping completes to prevent stale sessions from interfering with subsequent same-company scrapes - Increase same-company delay from 3s to 10s to give the site time to fully release the session Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t delay - Add retryFailedScrapes config option (default: true) that retries scrapes failing with GENERIC error after a 5s delay - Shows '🔄 Retrying' status during retry - Reduce same-company inter-account delay from 10s to 1s Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
22104f8 to
3f63c8c
Compare
Improve scraping reliability with several changes to the scraper infrastructure.
Changes
Visa Cal timeout patch (via patch-package)
Serialize same-company account scraping
Browser context cleanup
Retry failed scrapes (
etryFailedScrapes, default: \ rue)
Other
etryFailedScrapes\ config option in README and config example