feat: improve scraping reliability and same-company account handling by daniel-hauser · Pull Request #845 · daniel-hauser/moneyman

daniel-hauser · 2026-04-04T15:43:06Z

Improve scraping reliability with several changes to the scraper infrastructure.

Changes

Visa Cal timeout patch (via patch-package)

Increase init data timeout from 10s to 20s
Increase auth header timeout from 10s to 15s
Applies the fix from fix(visa-cal): resolve timeout getting init data and auth header (#1061) eshaham/israeli-bank-scrapers#1066

Serialize same-company account scraping

Accounts sharing the same \companyId\ now run sequentially within their group (using \Object.groupBy)
Different companies still run in parallel (up to \maxParallelScrapers)
Avoids conflicts when multiple accounts for the same company share a single browser/IP
1s cooldown between same-company accounts
Queued accounts show \⏳ Waiting\ status

Browser context cleanup

Each browser context is now closed after scraping completes (\inally\ block)
Prevents stale sessions from interfering with subsequent scrapes

Retry failed scrapes (
etryFailedScrapes, default: \ rue)

Scrapes that fail with a \GENERIC\ error are automatically retried once after a 5s delay
Shows \🔄 Retrying\ status during retry
Configurable via \options.scraping.retryFailedScrapes\ (can be disabled by setting to \alse)

Other

Bump tsconfig target to \�s2024\ (for \Object.groupBy\ support)
Documentation for new
etryFailedScrapes\ config option in README and config example

coderabbitai · 2026-04-04T15:43:19Z

Warning

Rate limit exceeded

@daniel-hauser has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 17 minutes and 22 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 17 minutes and 22 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 86bd3178-9858-41ba-9c45-3622832e1d72

📥 Commits

Reviewing files that changed from the base of the PR and between 802cb97 and 5a5e685.

📒 Files selected for processing (9)

README.md
config.example.jsonc
patches/israeli-bank-scrapers+6.7.1.patch
src/bot/storage/sheets.test.ts
src/config.schema.ts
src/config.ts
src/scraper/index.ts
src/types.ts
tsconfig.json

📝 Walkthrough

Walkthrough

Increased session-storage wait timeouts in the Visa CAL scraper and changed scraping scheduling to group accounts by company, processing each company's accounts sequentially while companies run concurrently; added a retry flag and bumped TS target.

Changes

Cohort / File(s)	Summary
Visa CAL timeout patch `patches/israeli-bank-scrapers+6.7.1.patch`	In `visa-cal.js` increased session-storage wait for `init` in `getCards()` to 20,000ms (from 10,000ms) and for `auth-module` in `getAuthorizationHeader()` to 15,000ms (from 10,000ms). Error/auth construction unchanged.
Scrape scheduling & retry `src/scraper/index.ts`, `src/types.ts`, `src/config.ts`, `src/config.schema.ts`	`scrapeAccounts` now groups `accounts` by `companyId` and runs one concurrent task per company with members processed sequentially; adds per-company status updates, SAME_COMPANY_DELAY_MS pauses, optional retry of generic failures when `retryFailedScrapes` is true, and flattens grouped results. Updated `ScraperConfig` type and config wiring to include `retryFailedScrapes`; added schema default true.
Tests `src/bot/storage/sheets.test.ts`	Test config updated to include `retryFailedScrapes: true` for Google Sheets storage test (behavior/assertions unchanged).
Tooling `tsconfig.json`	Bumped TypeScript `compilerOptions.target` from `es2022` to `es2024`.

Sequence Diagram(s)

sequenceDiagram
  participant Orchestrator as Orchestrator
  participant Grouper as GroupingLogic
  participant Parallel as parallelLimit
  participant CompanyTask as CompanyTask
  participant Scraper as scrapeAccount
  participant Emitter as scrapeStatusChanged

  Orchestrator->>Grouper: group accounts by companyId
  Grouper-->>Orchestrator: companyGroups
  Orchestrator->>Emitter: emit initial status (grouped waiting)
  Orchestrator->>Parallel: run CompanyTask for each company concurrently
  Parallel->>CompanyTask: start company task
  CompanyTask->>Emitter: update status for current account
  CompanyTask->>Scraper: scrape current account
  Scraper-->>CompanyTask: result / error
  alt error & retry enabled & type Generic
    CompanyTask->>Emitter: update status to retrying
    CompanyTask->>CompanyTask: wait RETRY_DELAY_MS
    CompanyTask->>Scraper: retry scrape
    Scraper-->>CompanyTask: result
  end
  CompanyTask->>Emitter: set next account status to waiting
  CompanyTask->>CompanyTask: wait SAME_COMPANY_DELAY_MS
  CompanyTask->>Scraper: scrape next account
  CompanyTask-->>Parallel: return company results (array)
  Parallel-->>Orchestrator: all company results (arrays)
  Orchestrator->>Orchestrator: flatten results, finalize, emit final status

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

patch(visa-cal): improve authorization header retrieval and error handling #697: Adjusts Visa CAL scraper’s authorization wait/handling — closely related to the timeout changes.
feat(scraper): add viewport size and navigation retry count #445: Modifies scrapeAccounts flow and per-account scraping behavior — touches the same scheduling/function boundaries.
bump israeli-bank-scrapers to version 6.6.0 #746: Also updates the visa-cal integration in israeli-bank-scrapers — related to the scraper patch.

Poem

🐰
I hopped through queues at break of day,
Grouped every company on my way,
I waited patient for tokens to show,
Then scrapes ran steady, row by row,
Carrots for tests — hip, hop, hooray! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title specifically describes the primary change (increasing timeouts for init data and auth header in visa-cal scraper), which aligns directly with the main objective and the patch file changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch patch/visa-cal-timeout-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-04T15:45:41Z

🐳 Docker image built and pushed: ghcr.io/daniel-hauser/moneyman:pr-845

📏 Size: 1.14GB
🔗 View on GitHub Container Registry

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/scraper/index.ts (1)

56-66: Clarify parallelScrapers semantics after scheduler change.

This now limits concurrent company groups, not concurrent individual accounts. Please update naming and/or config description to avoid tuning confusion.

Also applies to: 124-127

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/scraper/index.ts` around lines 56 - 66, The config name/variable
parallelScrapers is now misleading because the scheduler limits concurrent
companyGroups (accounts grouped by companyId) rather than individual accounts;
rename the setting and any variables/usages from parallelScrapers to something
like parallelCompanyGroups (and update related references in scheduler setup and
any docs/README/config schema), update the config description to explicitly
state it limits concurrent company groups (not accounts), and ensure any
code/comments referencing companyGroups, scheduler, or account-level concurrency
reflect the new name and semantics (also apply the same rename/update to the
other occurrence referenced around the scheduler logic where parallelScrapers is
used).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/scraper/index.ts`:
- Around line 108-129: The flattened groupResults from groupTasks (built from
companyGroups) currently yields company-group order instead of the original
input order; change the aggregation so each scrapeOne(account, index) result is
placed into a preallocated results array at its original index (use the provided
index from members) rather than pushing/group-order flattening—i.e., create an
output array sized to the total inputs, have each group task write its
AccountScrapeResult into results[index], and after parallelLimit(...) return
results (or map over indices) so callers receive results in the original input
order; adjust references to groupResults/results accordingly.

---

Nitpick comments:
In `@src/scraper/index.ts`:
- Around line 56-66: The config name/variable parallelScrapers is now misleading
because the scheduler limits concurrent companyGroups (accounts grouped by
companyId) rather than individual accounts; rename the setting and any
variables/usages from parallelScrapers to something like parallelCompanyGroups
(and update related references in scheduler setup and any docs/README/config
schema), update the config description to explicitly state it limits concurrent
company groups (not accounts), and ensure any code/comments referencing
companyGroups, scheduler, or account-level concurrency reflect the new name and
semantics (also apply the same rename/update to the other occurrence referenced
around the scheduler logic where parallelScrapers is used).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0dccc702-39d2-4d7f-b773-af85f1192cf4

📥 Commits

Reviewing files that changed from the base of the PR and between a3f0c53 and 8f8ca64.

📒 Files selected for processing (1)

src/scraper/index.ts

coderabbitai · 2026-04-04T16:05:59Z

+  const groupTasks = [...companyGroups.entries()].map(
+    ([, members]) =>
+      async () => {
+        const groupResults: AccountScrapeResult[] = [];
+        for (let j = 0; j < members.length; j++) {
+          const { account, index } = members[j];
+          if (j > 0) {
+            status[index] = `[${account.companyId}] ⏳ Waiting`;
+            await scrapeStatusChanged?.(status);
+          }
+          groupResults.push(await scrapeOne(account, index));
+        }
+        return groupResults;
+      },
  );
+
+  const groupResults = await parallelLimit<
+    (typeof groupTasks)[number],
+    AccountScrapeResult[][]
+  >(groupTasks, Number(parallelScrapers));
+
+  const results = groupResults.flat();


⚠️ Potential issue | 🟠 Major

Preserve original account order in returned results.

The new flattening returns company-group order, not original input order. For interleaved companies, this can misalign any caller that correlates by index.

💡 Proposed fix (keep grouped execution, restore stable output order)

- const groupTasks = [...companyGroups.entries()].map( + const groupTasks = [...companyGroups.entries()].map( ([, members]) => async () => { - const groupResults: AccountScrapeResult[] = []; + const groupResults: Array<{ + index: number; + result: AccountScrapeResult; + }> = []; for (let j = 0; j < members.length; j++) { const { account, index } = members[j]; if (j > 0) { status[index] = `[${account.companyId}] ⏳ Waiting`; await scrapeStatusChanged?.(status); } - groupResults.push(await scrapeOne(account, index)); + groupResults.push({ index, result: await scrapeOne(account, index) }); } return groupResults; }, ); - const groupResults = await parallelLimit< + const groupResults = await parallelLimit< (typeof groupTasks)[number], - AccountScrapeResult[][] + Array<{ index: number; result: AccountScrapeResult }>[] >(groupTasks, Number(parallelScrapers)); - const results = groupResults.flat(); + const results = groupResults + .flat() + .sort((a, b) => a.index - b.index) + .map(({ result }) => result);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/scraper/index.ts` around lines 108 - 129, The flattened groupResults from groupTasks (built from companyGroups) currently yields company-group order instead of the original input order; change the aggregation so each scrapeOne(account, index) result is placed into a preallocated results array at its original index (use the provided index from members) rather than pushing/group-order flattening—i.e., create an output array sized to the total inputs, have each group task write its AccountScrapeResult into results[index], and after parallelLimit(...) return results (or map over indices) so callers receive results in the original input order; adjust references to groupResults/results accordingly.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/scraper/index.ts (1)
121-126: ⚠️ Potential issue | 🟠 Major

Still unresolved: return results in the original accounts order.

This is still flattening by company group, so the returned array no longer lines up with the caller's input order. Any consumer that correlates results[i] with accounts[i] will now be wrong; please place each scrapeOne(account, index) result back at its original index before returning.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scraper/index.ts` around lines 121 - 126, The flattened groupResults
currently returns results grouped by company rather than matching the original
accounts order; change the aggregation so each scrapeOne(account, index) stores
its AccountScrapeResult at its original index and then return an array assembled
by original indices. Specifically, when building groupTasks and handling
parallelLimit(groupTasks, ...), ensure each task calls scrapeOne(account, index)
with the original account index and writes its result into a preallocated
results array (using the index) instead of relying on groupResults.flat();
return that indexed results array so results[i] corresponds to accounts[i].

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/scraper/index.ts`:
- Around line 106-115: The current groupTasks implementation sleeps with await
setTimeout(SAME_COMPANY_DELAY_MS) inside the worker, occupying one of the
limited global worker slots; change to enforce same-company serialization
without blocking the worker by implementing a per-company chain/queue: create a
companyChains map keyed by account.companyId (or a per-company mutex/Promise
chain) and, for each account, enqueue its work by appending to
companyChains[companyId] a Promise that runs the scrape for that account then
returns after SAME_COMPANY_DELAY_MS; in groupTasks (and functions like the
account processing loop that call scrapeStatusChanged) remove the in-worker
await sleep and instead await the per-company chain Promise so the global worker
is free between chained tasks; reference symbols: groupTasks, companyGroups,
SAME_COMPANY_DELAY_MS, scrapeStatusChanged, and the account processing loop to
locate and replace the blocking await with a per-company chaining mechanism.

---

Duplicate comments:
In `@src/scraper/index.ts`:
- Around line 121-126: The flattened groupResults currently returns results
grouped by company rather than matching the original accounts order; change the
aggregation so each scrapeOne(account, index) stores its AccountScrapeResult at
its original index and then return an array assembled by original indices.
Specifically, when building groupTasks and handling parallelLimit(groupTasks,
...), ensure each task calls scrapeOne(account, index) with the original account
index and writes its result into a preallocated results array (using the index)
instead of relying on groupResults.flat(); return that indexed results array so
results[i] corresponds to accounts[i].

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2b954238-260e-4c51-ad93-0d2f4a1f9742

📥 Commits

Reviewing files that changed from the base of the PR and between 8f8ca64 and 09fb6bb.

📒 Files selected for processing (2)

src/scraper/index.ts
tsconfig.json

✅ Files skipped from review due to trivial changes (1)

tsconfig.json

coderabbitai

🧹 Nitpick comments (1)

src/scraper/index.ts (1)
142-146: Redundant status update for "Waiting" accounts.

Line 143 sets status[index] to "⏳ Waiting" again, but this was already set in the initial loop at lines 72-74 for all non-first members. The status is correct, but this line is duplicative.

However, there may be a subtle difference: if parallelScrapers > 1 and groups run concurrently, by the time we reach a subsequent member in rest, its status might have been displayed as "Waiting" initially but the user may benefit from seeing the update again after the previous member completes. This makes the redundancy acceptable for UX clarity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scraper/index.ts` around lines 142 - 146, The status assignment
status[index] = `[${account.companyId}] ⏳ Waiting`; inside the rest loop is
redundant because non-first members were already initialized to "Waiting", so
either remove that assignment or guard it so it only runs when it will change
UI: update it only if status[index] !== `[${account.companyId}] ⏳ Waiting` or
when running with concurrent scrapers (e.g., parallelScrapers > 1) to refresh
the UI; keep the subsequent await scrapeStatusChanged?.(status),
SAME_COMPANY_DELAY_MS and groupResults.push(await scrapeOne(account, index))
unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/scraper/index.ts`:
- Around line 142-146: The status assignment status[index] =
`[${account.companyId}] ⏳ Waiting`; inside the rest loop is redundant because
non-first members were already initialized to "Waiting", so either remove that
assignment or guard it so it only runs when it will change UI: update it only if
status[index] !== `[${account.companyId}] ⏳ Waiting` or when running with
concurrent scrapers (e.g., parallelScrapers > 1) to refresh the UI; keep the
subsequent await scrapeStatusChanged?.(status), SAME_COMPANY_DELAY_MS and
groupResults.push(await scrapeOne(account, index)) unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 07c4b18f-d9bd-419a-9e0b-0bae515514c1

📥 Commits

Reviewing files that changed from the base of the PR and between 09fb6bb and 802cb97.

📒 Files selected for processing (5)

src/bot/storage/sheets.test.ts
src/config.schema.ts
src/config.ts
src/scraper/index.ts
src/types.ts

✅ Files skipped from review due to trivial changes (2)

src/types.ts
src/bot/storage/sheets.test.ts

Apply change from eshaham/israeli-bank-scrapers#1066: - Increase init data timeout from 10s to 20s - Increase auth header timeout from 10s to 15s Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Accounts sharing the same companyId now run sequentially within their group while different companies still run in parallel (up to maxParallelScrapers). This avoids conflicts when multiple accounts for the same company share a single browser/IP. Queued accounts show a '⏳ Waiting' status until their turn. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace manual Map-based grouping with Object.groupBy (ES2024). Bump tsconfig target to es2024 to support it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds a short delay before starting the next account in a same-company group, giving the previous session time to fully close. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Close each browser context after scraping completes to prevent stale sessions from interfering with subsequent same-company scrapes - Increase same-company delay from 3s to 10s to give the site time to fully release the session Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t delay - Add retryFailedScrapes config option (default: true) that retries scrapes failing with GENERIC error after a 5s delay - Shows '🔄 Retrying' status during retry - Reduce same-company inter-account delay from 10s to 1s Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

daniel-hauser changed the title ~~fix(visa-cal): increase timeouts for init data and auth header~~ patch(visa-cal): increase timeouts for init data and auth header Apr 4, 2026

coderabbitai bot reviewed Apr 4, 2026

View reviewed changes

daniel-hauser force-pushed the patch/visa-cal-timeout-fix branch from 09fb6bb to 50bea7e Compare April 4, 2026 16:18

coderabbitai bot reviewed Apr 4, 2026

View reviewed changes

Comment thread src/scraper/index.ts Outdated

coderabbitai bot reviewed Apr 4, 2026

View reviewed changes

daniel-hauser force-pushed the patch/visa-cal-timeout-fix branch from 802cb97 to 5a5e685 Compare April 4, 2026 20:25

daniel-hauser changed the title ~~patch(visa-cal): increase timeouts for init data and auth header~~ feat: improve scraping reliability and same-company account handling Apr 4, 2026

daniel-hauser and others added 9 commits April 5, 2026 00:37

fix(visa-cal): increase timeouts for init data and auth header

b8edfad

Apply change from eshaham/israeli-bank-scrapers#1066: - Increase init data timeout from 10s to 20s - Increase auth header timeout from 10s to 15s Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

[autofix.ci] apply automated fixes

9c40a4e

refactor: use Object.groupBy for company grouping

d6b76fb

Replace manual Map-based grouping with Object.groupBy (ES2024). Bump tsconfig target to es2024 to support it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: add 3s cooldown between same-company account scrapes

e49ea76

Adds a short delay before starting the next account in a same-company group, giving the previous session time to fully close. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

[autofix.ci] apply automated fixes

55eed41

docs: add retryFailedScrapes config documentation

3f63c8c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

daniel-hauser force-pushed the patch/visa-cal-timeout-fix branch from 22104f8 to 3f63c8c Compare April 4, 2026 21:40

[autofix.ci] apply automated fixes

df298ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: improve scraping reliability and same-company account handling#845

feat: improve scraping reliability and same-company account handling#845
daniel-hauser wants to merge 10 commits intomainfrom
patch/visa-cal-timeout-fix

daniel-hauser commented Apr 4, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 4, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Apr 4, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 4, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

daniel-hauser commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

coderabbitai bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

daniel-hauser commented Apr 4, 2026 •

edited

Loading

coderabbitai bot commented Apr 4, 2026 •

edited

Loading

github-actions bot commented Apr 4, 2026 •

edited

Loading