Skip to content

feat: improve scraping reliability and same-company account handling#845

Open
daniel-hauser wants to merge 10 commits intomainfrom
patch/visa-cal-timeout-fix
Open

feat: improve scraping reliability and same-company account handling#845
daniel-hauser wants to merge 10 commits intomainfrom
patch/visa-cal-timeout-fix

Conversation

@daniel-hauser
Copy link
Copy Markdown
Owner

@daniel-hauser daniel-hauser commented Apr 4, 2026

Improve scraping reliability with several changes to the scraper infrastructure.

Changes

Visa Cal timeout patch (via patch-package)

Serialize same-company account scraping

  • Accounts sharing the same \companyId\ now run sequentially within their group (using \Object.groupBy)
  • Different companies still run in parallel (up to \maxParallelScrapers)
  • Avoids conflicts when multiple accounts for the same company share a single browser/IP
  • 1s cooldown between same-company accounts
  • Queued accounts show \⏳ Waiting\ status

Browser context cleanup

  • Each browser context is now closed after scraping completes (\ inally\ block)
  • Prevents stale sessions from interfering with subsequent scrapes

Retry failed scrapes (
etryFailedScrapes, default: \ rue)

  • Scrapes that fail with a \GENERIC\ error are automatically retried once after a 5s delay
  • Shows \🔄 Retrying\ status during retry
  • Configurable via \options.scraping.retryFailedScrapes\ (can be disabled by setting to \ alse)

Other

  • Bump tsconfig target to \�s2024\ (for \Object.groupBy\ support)
  • Documentation for new
    etryFailedScrapes\ config option in README and config example

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

Warning

Rate limit exceeded

@daniel-hauser has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 17 minutes and 22 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 17 minutes and 22 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 86bd3178-9858-41ba-9c45-3622832e1d72

📥 Commits

Reviewing files that changed from the base of the PR and between 802cb97 and 5a5e685.

📒 Files selected for processing (9)
  • README.md
  • config.example.jsonc
  • patches/israeli-bank-scrapers+6.7.1.patch
  • src/bot/storage/sheets.test.ts
  • src/config.schema.ts
  • src/config.ts
  • src/scraper/index.ts
  • src/types.ts
  • tsconfig.json
📝 Walkthrough

Walkthrough

Increased session-storage wait timeouts in the Visa CAL scraper and changed scraping scheduling to group accounts by company, processing each company's accounts sequentially while companies run concurrently; added a retry flag and bumped TS target.

Changes

Cohort / File(s) Summary
Visa CAL timeout patch
patches/israeli-bank-scrapers+6.7.1.patch
In visa-cal.js increased session-storage wait for init in getCards() to 20,000ms (from 10,000ms) and for auth-module in getAuthorizationHeader() to 15,000ms (from 10,000ms). Error/auth construction unchanged.
Scrape scheduling & retry
src/scraper/index.ts, src/types.ts, src/config.ts, src/config.schema.ts
scrapeAccounts now groups accounts by companyId and runs one concurrent task per company with members processed sequentially; adds per-company status updates, SAME_COMPANY_DELAY_MS pauses, optional retry of generic failures when retryFailedScrapes is true, and flattens grouped results. Updated ScraperConfig type and config wiring to include retryFailedScrapes; added schema default true.
Tests
src/bot/storage/sheets.test.ts
Test config updated to include retryFailedScrapes: true for Google Sheets storage test (behavior/assertions unchanged).
Tooling
tsconfig.json
Bumped TypeScript compilerOptions.target from es2022 to es2024.

Sequence Diagram(s)

sequenceDiagram
  participant Orchestrator as Orchestrator
  participant Grouper as GroupingLogic
  participant Parallel as parallelLimit
  participant CompanyTask as CompanyTask
  participant Scraper as scrapeAccount
  participant Emitter as scrapeStatusChanged

  Orchestrator->>Grouper: group accounts by companyId
  Grouper-->>Orchestrator: companyGroups
  Orchestrator->>Emitter: emit initial status (grouped waiting)
  Orchestrator->>Parallel: run CompanyTask for each company concurrently
  Parallel->>CompanyTask: start company task
  CompanyTask->>Emitter: update status for current account
  CompanyTask->>Scraper: scrape current account
  Scraper-->>CompanyTask: result / error
  alt error & retry enabled & type Generic
    CompanyTask->>Emitter: update status to retrying
    CompanyTask->>CompanyTask: wait RETRY_DELAY_MS
    CompanyTask->>Scraper: retry scrape
    Scraper-->>CompanyTask: result
  end
  CompanyTask->>Emitter: set next account status to waiting
  CompanyTask->>CompanyTask: wait SAME_COMPANY_DELAY_MS
  CompanyTask->>Scraper: scrape next account
  CompanyTask-->>Parallel: return company results (array)
  Parallel-->>Orchestrator: all company results (arrays)
  Orchestrator->>Orchestrator: flatten results, finalize, emit final status
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰
I hopped through queues at break of day,
Grouped every company on my way,
I waited patient for tokens to show,
Then scrapes ran steady, row by row,
Carrots for tests — hip, hop, hooray! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title specifically describes the primary change (increasing timeouts for init data and auth header in visa-cal scraper), which aligns directly with the main objective and the patch file changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch patch/visa-cal-timeout-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@daniel-hauser daniel-hauser changed the title fix(visa-cal): increase timeouts for init data and auth header patch(visa-cal): increase timeouts for init data and auth header Apr 4, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🐳 Docker image built and pushed: ghcr.io/daniel-hauser/moneyman:pr-845

📏 Size: 1.14GB
🔗 View on GitHub Container Registry

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/scraper/index.ts (1)

56-66: Clarify parallelScrapers semantics after scheduler change.

This now limits concurrent company groups, not concurrent individual accounts. Please update naming and/or config description to avoid tuning confusion.

Also applies to: 124-127

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scraper/index.ts` around lines 56 - 66, The config name/variable
parallelScrapers is now misleading because the scheduler limits concurrent
companyGroups (accounts grouped by companyId) rather than individual accounts;
rename the setting and any variables/usages from parallelScrapers to something
like parallelCompanyGroups (and update related references in scheduler setup and
any docs/README/config schema), update the config description to explicitly
state it limits concurrent company groups (not accounts), and ensure any
code/comments referencing companyGroups, scheduler, or account-level concurrency
reflect the new name and semantics (also apply the same rename/update to the
other occurrence referenced around the scheduler logic where parallelScrapers is
used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/scraper/index.ts`:
- Around line 108-129: The flattened groupResults from groupTasks (built from
companyGroups) currently yields company-group order instead of the original
input order; change the aggregation so each scrapeOne(account, index) result is
placed into a preallocated results array at its original index (use the provided
index from members) rather than pushing/group-order flattening—i.e., create an
output array sized to the total inputs, have each group task write its
AccountScrapeResult into results[index], and after parallelLimit(...) return
results (or map over indices) so callers receive results in the original input
order; adjust references to groupResults/results accordingly.

---

Nitpick comments:
In `@src/scraper/index.ts`:
- Around line 56-66: The config name/variable parallelScrapers is now misleading
because the scheduler limits concurrent companyGroups (accounts grouped by
companyId) rather than individual accounts; rename the setting and any
variables/usages from parallelScrapers to something like parallelCompanyGroups
(and update related references in scheduler setup and any docs/README/config
schema), update the config description to explicitly state it limits concurrent
company groups (not accounts), and ensure any code/comments referencing
companyGroups, scheduler, or account-level concurrency reflect the new name and
semantics (also apply the same rename/update to the other occurrence referenced
around the scheduler logic where parallelScrapers is used).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0dccc702-39d2-4d7f-b773-af85f1192cf4

📥 Commits

Reviewing files that changed from the base of the PR and between a3f0c53 and 8f8ca64.

📒 Files selected for processing (1)
  • src/scraper/index.ts

Comment thread src/scraper/index.ts Outdated
Comment on lines +108 to +129
const groupTasks = [...companyGroups.entries()].map(
([, members]) =>
async () => {
const groupResults: AccountScrapeResult[] = [];
for (let j = 0; j < members.length; j++) {
const { account, index } = members[j];
if (j > 0) {
status[index] = `[${account.companyId}] ⏳ Waiting`;
await scrapeStatusChanged?.(status);
}
groupResults.push(await scrapeOne(account, index));
}
return groupResults;
},
);

const groupResults = await parallelLimit<
(typeof groupTasks)[number],
AccountScrapeResult[][]
>(groupTasks, Number(parallelScrapers));

const results = groupResults.flat();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve original account order in returned results.

The new flattening returns company-group order, not original input order. For interleaved companies, this can misalign any caller that correlates by index.

💡 Proposed fix (keep grouped execution, restore stable output order)
-  const groupTasks = [...companyGroups.entries()].map(
+  const groupTasks = [...companyGroups.entries()].map(
     ([, members]) =>
       async () => {
-        const groupResults: AccountScrapeResult[] = [];
+        const groupResults: Array<{
+          index: number;
+          result: AccountScrapeResult;
+        }> = [];
         for (let j = 0; j < members.length; j++) {
           const { account, index } = members[j];
           if (j > 0) {
             status[index] = `[${account.companyId}] ⏳ Waiting`;
             await scrapeStatusChanged?.(status);
           }
-          groupResults.push(await scrapeOne(account, index));
+          groupResults.push({ index, result: await scrapeOne(account, index) });
         }
         return groupResults;
       },
   );
 
-  const groupResults = await parallelLimit<
+  const groupResults = await parallelLimit<
     (typeof groupTasks)[number],
-    AccountScrapeResult[][]
+    Array<{ index: number; result: AccountScrapeResult }>[]
   >(groupTasks, Number(parallelScrapers));
 
-  const results = groupResults.flat();
+  const results = groupResults
+    .flat()
+    .sort((a, b) => a.index - b.index)
+    .map(({ result }) => result);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scraper/index.ts` around lines 108 - 129, The flattened groupResults from
groupTasks (built from companyGroups) currently yields company-group order
instead of the original input order; change the aggregation so each
scrapeOne(account, index) result is placed into a preallocated results array at
its original index (use the provided index from members) rather than
pushing/group-order flattening—i.e., create an output array sized to the total
inputs, have each group task write its AccountScrapeResult into results[index],
and after parallelLimit(...) return results (or map over indices) so callers
receive results in the original input order; adjust references to
groupResults/results accordingly.

@daniel-hauser daniel-hauser force-pushed the patch/visa-cal-timeout-fix branch from 09fb6bb to 50bea7e Compare April 4, 2026 16:18
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/scraper/index.ts (1)

121-126: ⚠️ Potential issue | 🟠 Major

Still unresolved: return results in the original accounts order.

This is still flattening by company group, so the returned array no longer lines up with the caller's input order. Any consumer that correlates results[i] with accounts[i] will now be wrong; please place each scrapeOne(account, index) result back at its original index before returning.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scraper/index.ts` around lines 121 - 126, The flattened groupResults
currently returns results grouped by company rather than matching the original
accounts order; change the aggregation so each scrapeOne(account, index) stores
its AccountScrapeResult at its original index and then return an array assembled
by original indices. Specifically, when building groupTasks and handling
parallelLimit(groupTasks, ...), ensure each task calls scrapeOne(account, index)
with the original account index and writes its result into a preallocated
results array (using the index) instead of relying on groupResults.flat();
return that indexed results array so results[i] corresponds to accounts[i].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/scraper/index.ts`:
- Around line 106-115: The current groupTasks implementation sleeps with await
setTimeout(SAME_COMPANY_DELAY_MS) inside the worker, occupying one of the
limited global worker slots; change to enforce same-company serialization
without blocking the worker by implementing a per-company chain/queue: create a
companyChains map keyed by account.companyId (or a per-company mutex/Promise
chain) and, for each account, enqueue its work by appending to
companyChains[companyId] a Promise that runs the scrape for that account then
returns after SAME_COMPANY_DELAY_MS; in groupTasks (and functions like the
account processing loop that call scrapeStatusChanged) remove the in-worker
await sleep and instead await the per-company chain Promise so the global worker
is free between chained tasks; reference symbols: groupTasks, companyGroups,
SAME_COMPANY_DELAY_MS, scrapeStatusChanged, and the account processing loop to
locate and replace the blocking await with a per-company chaining mechanism.

---

Duplicate comments:
In `@src/scraper/index.ts`:
- Around line 121-126: The flattened groupResults currently returns results
grouped by company rather than matching the original accounts order; change the
aggregation so each scrapeOne(account, index) stores its AccountScrapeResult at
its original index and then return an array assembled by original indices.
Specifically, when building groupTasks and handling parallelLimit(groupTasks,
...), ensure each task calls scrapeOne(account, index) with the original account
index and writes its result into a preallocated results array (using the index)
instead of relying on groupResults.flat(); return that indexed results array so
results[i] corresponds to accounts[i].
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2b954238-260e-4c51-ad93-0d2f4a1f9742

📥 Commits

Reviewing files that changed from the base of the PR and between 8f8ca64 and 09fb6bb.

📒 Files selected for processing (2)
  • src/scraper/index.ts
  • tsconfig.json
✅ Files skipped from review due to trivial changes (1)
  • tsconfig.json

Comment thread src/scraper/index.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/scraper/index.ts (1)

142-146: Redundant status update for "Waiting" accounts.

Line 143 sets status[index] to "⏳ Waiting" again, but this was already set in the initial loop at lines 72-74 for all non-first members. The status is correct, but this line is duplicative.

However, there may be a subtle difference: if parallelScrapers > 1 and groups run concurrently, by the time we reach a subsequent member in rest, its status might have been displayed as "Waiting" initially but the user may benefit from seeing the update again after the previous member completes. This makes the redundancy acceptable for UX clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scraper/index.ts` around lines 142 - 146, The status assignment
status[index] = `[${account.companyId}] ⏳ Waiting`; inside the rest loop is
redundant because non-first members were already initialized to "Waiting", so
either remove that assignment or guard it so it only runs when it will change
UI: update it only if status[index] !== `[${account.companyId}] ⏳ Waiting` or
when running with concurrent scrapers (e.g., parallelScrapers > 1) to refresh
the UI; keep the subsequent await scrapeStatusChanged?.(status),
SAME_COMPANY_DELAY_MS and groupResults.push(await scrapeOne(account, index))
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/scraper/index.ts`:
- Around line 142-146: The status assignment status[index] =
`[${account.companyId}] ⏳ Waiting`; inside the rest loop is redundant because
non-first members were already initialized to "Waiting", so either remove that
assignment or guard it so it only runs when it will change UI: update it only if
status[index] !== `[${account.companyId}] ⏳ Waiting` or when running with
concurrent scrapers (e.g., parallelScrapers > 1) to refresh the UI; keep the
subsequent await scrapeStatusChanged?.(status), SAME_COMPANY_DELAY_MS and
groupResults.push(await scrapeOne(account, index)) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 07c4b18f-d9bd-419a-9e0b-0bae515514c1

📥 Commits

Reviewing files that changed from the base of the PR and between 09fb6bb and 802cb97.

📒 Files selected for processing (5)
  • src/bot/storage/sheets.test.ts
  • src/config.schema.ts
  • src/config.ts
  • src/scraper/index.ts
  • src/types.ts
✅ Files skipped from review due to trivial changes (2)
  • src/types.ts
  • src/bot/storage/sheets.test.ts

@daniel-hauser daniel-hauser force-pushed the patch/visa-cal-timeout-fix branch from 802cb97 to 5a5e685 Compare April 4, 2026 20:25
@daniel-hauser daniel-hauser changed the title patch(visa-cal): increase timeouts for init data and auth header feat: improve scraping reliability and same-company account handling Apr 4, 2026
daniel-hauser and others added 9 commits April 5, 2026 00:37
Apply change from eshaham/israeli-bank-scrapers#1066:
- Increase init data timeout from 10s to 20s
- Increase auth header timeout from 10s to 15s

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Accounts sharing the same companyId now run sequentially within their
group while different companies still run in parallel (up to
maxParallelScrapers). This avoids conflicts when multiple accounts
for the same company share a single browser/IP.

Queued accounts show a '⏳ Waiting' status until their turn.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace manual Map-based grouping with Object.groupBy (ES2024).
Bump tsconfig target to es2024 to support it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a short delay before starting the next account in a same-company
group, giving the previous session time to fully close.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Close each browser context after scraping completes to prevent
  stale sessions from interfering with subsequent same-company scrapes
- Increase same-company delay from 3s to 10s to give the site time
  to fully release the session

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t delay

- Add retryFailedScrapes config option (default: true) that retries
  scrapes failing with GENERIC error after a 5s delay
- Shows '🔄 Retrying' status during retry
- Reduce same-company inter-account delay from 10s to 1s

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@daniel-hauser daniel-hauser force-pushed the patch/visa-cal-timeout-fix branch from 22104f8 to 3f63c8c Compare April 4, 2026 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant