chore/sc-41640/reducing-tokens-of-disambiguator by nsantacruz · Pull Request #3262 · Sefaria/Sefaria-Project

nsantacruz · 2026-05-05T09:59:04Z

Description

A brief description of the PR

Code Changes

The following changes were made to the files below

Notes

Any additional notes go here

…iguator

… for ambiguous entries

…ates

… processing

…n logic

…rmance

…ed reference handling

…d build section-level base refs _get_commentary_base_ref previously matched nodes by internal key (e.g. 'OrachChaim'), causing misses when the key diverges from the English display title (e.g. Prisha → Tur). It also did not handle the case where the base index has an extra SchemaNode wrapper around its content leaf. Changes: - Match nodes by primary English title instead of schema key - After a successful match, append non-default node titles to base_title to form a valid complex ref (e.g. 'Tur, Orach Chayim') - Navigate through any intermediate SchemaNodes to the content leaf, then zip with the leaf's addressTypes[:-1] to produce a section-level (not segment-level) base ref - Return early from the complex/complex branch so the generic section loop only runs for simple bases Add disambiguator_helpers_test.py covering None/empty inputs, non-commentary, multiple base titles, book-level citing refs, both-simple Torah and Talmud cases, both-complex with matching and mismatching titles, complex/simple XOR cases, and missing base_text_titles.

High-score Dicta matches (score >= 8) previously bypassed LLM confirmation, causing false positives when the citation was a vague book/folio reference with no direct quote. Key changes: - Add _verify_high_score: a dedicated LLM check for high-score candidates with a focused prompt warning about book-title citations, bare folio references, and incidental topical similarity - Add VERIFICATION_EXAMPLES: 6 few-shot examples (4 NO, 2 YES) covering book titles, folio refs with/without inline quotes, paraphrases, and whole-book references; placed in SystemMessage with cache_control - Add _get_candidate_text_for_confirmation: pads short candidates with neighboring segments for better confirmation context - Update _llm_form_prior to flag whole-book/chapter citations early - Move examples into SystemMessage with explicit cache_control in both _llm_confirm_candidate and _verify_high_score Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Skip non-segment spans that already have `llm_resolved_ref_non_segment` set, and treat ambiguous losers (llm_ambiguous_option_valid=False) as resolved so they are not re-grouped and re-dispatched. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

False positives occurred when Dicta matched a high-score quote elsewhere in the document body rather than adjacent to the actual citation marker. Now only bypass LLM verification when the matched phrase is within 60 characters of the citation in the windowed text, measuring from the nearest edges of the phrase and citation intervals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ref_but_rejected When _confirm_candidate returns NO (in both the Dicta and fallback search paths), save the rejected candidate ref on the span under llm_resolved_ref_but_rejected. Adds rejected_ref to NonSegmentResolutionResult and threads it through the apply functions so it is written even when no resolution is ultimately found. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…Text to best-matching window - _dicta_phrase_distance: replace abs-distance formula with proper interval logic so any overlap between the resolution phrase and the citation returns 0 (previously a phrase containing the citation could return a large positive distance) - resolution_phrase: when Dicta provides both baseMatchedText and compMatchedText, use Levenshtein sliding-window search (normalized to strip nikkud/cantillation) to find the sub-span of baseMatchedText closest to compMatchedText, then map positions back to the original string so the result is always a true substring Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the disabled `if False` gate with a rule derived from 1,600-row statistical analysis: - section_level=True AND score>=8 AND dist<=10: 99.3% precision (2 misses/280) - score>=15 AND dist<=5 (any section level): 100% precision (0 misses/101) Previous rule (score>=6 / score>=8 with no distance gate) had ~4.3% miss rate and included a risky non-section-level branch at only ~88% precision. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…uture llm calls

…uator

gitvelocity-reviewer · 2026-05-05T09:59:52Z

📊 Code Quality Score: 27/100

34 × 0.8 (Large ESF) = 27.2, rounded to 27

Category	Score	Factors
🔭 Scope	8/20	3 new analysis scripts, 1 new dependency (levenshtein), gitignore update, CSV data file committed; touches production linker code paths but no new APIs or endpoints
🏗️ Architecture	5/20	Scripts reuse production helpers appropriately; duplicates production threshold logic creating maintenance risk; no new architectural patterns
⚙️ Implementation	10/20	ThreadPoolExecutor parallelism, multi-stage analysis logic, token counting via API, CSV I/O; moderate complexity; bugs include redundant retry logic and thread-unsafe env var mutation
⚠️ Risk	6/20	Analysis scripts only (not production code); committed CSV data file is repository hygiene risk; hardcoded model names may be incorrect; no migrations or auth changes
✅ Quality	3/15	No tests added; unused import (asdict); misleading help text (default 100 vs actual 2000); large data file committed; good docstrings partially offset quality concerns
🔒 Perf / Security	2/5	50-thread parallel API calls considered; no retry/backoff for rate limits; no security concerns for analysis scripts

_{Scored by GitVelocity · How are scores calculated?}

nsantacruz added 30 commits February 19, 2026 22:41

chore: enable debug mode and adjust sampling parameters

7108cd8

chore: add "anti_talmud_no_book" tag to traceable functions in disamb…

4f053d1

…iguator

chore: add CSV test cases for disambiguator functionality

09343a0

test: update disambiguator_test_set.csv

678064f

chore: update disambiguator_test_set.csv to include additional tokens…

6a0e703

… for ambiguous entries

chore: add more rows to "optional"

a339782

chore: update disambiguator.py to clarify keyword spelling guidelines

6f91a4b

chore: add another correct option

e78530e

chore: update parameters of test

4888e38

chore: update disambiguator.py for new model version and debugging tags

31513f8

refactor: major refactor to reduce code duplication by opus 4.6

00d48d9

chore: normalize text before sending to llm to reduce tokens

3d7856b

chore: update LANGSMITH_DEBUG_TAG and remove unused expanded query logic

8f475fc

chore: update LANGSMITH_DEBUG_TAG and add highlight feature to candid…

b0fbb5f

…ates

chore: implement caching for _llm_form_prior to improve performance

78ca01f

chore: implement bag-of-words scoring to reduce candidates before LLM…

c65aa9c

… processing

chore: add support for Google Generative AI and update token reductio…

3b077cc

…n logic

chore: implement caching for debug samples to improve efficiency

1163d9a

chore: reduce DEBUG_LIMIT for sampling in debug mode to improve perfo…

996997a

…rmance

chore: add script to export documents from MongoDB to CSV with enhanc…

53199e3

…ed reference handling

chore: add .db to gitignore

eb33d5d

chore: disambiguator scripts

7e9b9a6

chore: change debug limit

280506f

refactor: use base text caching and remove base texts for whole books

ff52b55

chore: update disambiguator scripts

2ffb24e

chore: add prior to candidate selection, refactor base text calculation

5d31a68

chore: more disambiguator scripts

5789ed0

chore: update tests

17ef0b7

chore: update scripts

08fa91a

nsantacruz and others added 19 commits April 29, 2026 17:21

chore: add lang param

462d531

chore: strip footnotes if citation is outside footnote

4b6fd79

chore: update scripts

5e333dc

chore: add levenshtein as requirement

3e76d93

chore: tests

34cad14

chore: improve dicta_high_score heuristic

153e323

chore: include base text summary in prior and remove base text from f…

5f408ee

…uture llm calls

chore: add citing ref to prompts

72a1677

chore: update scripts

00b4482

Merge branch 'master' into chore/sc-41640/reducing-tokens-of-disambig…

694ddea

…uator

fix: move prior to after dicta. remove google

4c36dfd

fix: shufflin

b692f4e

feat: allow specifying debug parent ref

7b76c3a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore/sc-41640/reducing-tokens-of-disambiguator#3262

chore/sc-41640/reducing-tokens-of-disambiguator#3262
nsantacruz wants to merge 49 commits intomasterfrom
chore/sc-41640/reducing-tokens-of-disambiguator

nsantacruz commented May 5, 2026

Uh oh!

gitvelocity-reviewer Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nsantacruz commented May 5, 2026

Description

Code Changes

Notes

Uh oh!

gitvelocity-reviewer Bot commented May 5, 2026

📊 Code Quality Score: 27/100

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant