feat(discovery): yield-weighted query prioritization (#3)#205
Merged
Conversation
PR 3 of 3. Measures which keywords/terms actually produced index-relevant
records and prioritizes them when a query budget caps a run, so spend goes to
proven enforcement-law terms over noisy generic ones.
Signal chain (record -> candidate -> query):
- an index-relevant operational record lists its event_candidate_ids;
- each candidate records the discovery_queries that found it;
- each query text feeding a relevant record earns one yield point (deduped per
record).
- query_yield.py: compute_query_yield (pure) + QueryYieldStore (JSON cache).
- select_run_queries / build_discovery_queries gain a yield_of callback; the cap
now orders by (yield, kind priority, recency, index).
- Engine fns load the cached yield map and pass yield_of to both build and the
$-budget guard.
- CLI `denbust query-yield` computes from the operational + candidate stores,
caches query_yield.json, and prints the top keywords.
Validated on the live index (16 records): top-yield terms are the specific
trafficking/prostitution-law taxonomy terms ("הבאת אדם למדינה אחרת לשם העיסוק
בזנות", "המודל הנורדי", "סחר בנשים"), not generic keywords — exactly the spend
priority we want.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a yield-based signal to discovery query selection so capped runs spend limited query budget on historically productive keywords (those that have contributed to index-relevant records), with a cached yield map and a CLI to compute it.
Changes:
- Introduce
compute_query_yield+QueryYieldStoreto compute and cache per-query yield inquery_yield.json. - Thread an optional
yield_ofcallback throughbuild_discovery_queries/select_run_queriesand the engine budget guard so ordering becomes (yield, kind priority, recency). - Add
denbust query-yieldCLI command, unit tests, and protocol docs updates.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_query_yield.py | Adds unit coverage for yield computation, store round-trip, and high-yield prioritization under caps. |
| src/denbust/pipeline.py | Loads cached yield map and applies yield-based ordering for engine query building and budget-guard capping. |
| src/denbust/discovery/state_paths.py | Adds a dedicated query_yield_path under discovery state. |
| src/denbust/discovery/query_yield.py | New module implementing yield computation and JSON cache store. |
| src/denbust/discovery/queries.py | Extends query selection/building to accept yield callback and incorporate it into ordering. |
| src/denbust/cli.py | Adds denbust query-yield command to compute/cache yield and print top terms. |
| docs/batch_scraping_protocol.md | Documents yield-weighted prioritization and how to compute/cache yield. |
Comment on lines
+61
to
+67
| def load(self) -> dict[str, int]: | ||
| if not self.path.exists(): | ||
| return {} | ||
| data = json.loads(self.path.read_text(encoding="utf-8")) | ||
| if not isinstance(data, dict): | ||
| return {} | ||
| return {str(k): int(v) for k, v in data.items()} |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #205 +/- ##
=======================================
Coverage 92.80% 92.81%
=======================================
Files 82 83 +1
Lines 12215 12259 +44
=======================================
+ Hits 11336 11378 +42
- Misses 879 881 +2
🚀 New features to boost your workflow:
|
|
pr-agent-context report: This run includes an unresolved review comment and patch coverage gaps on PR #205.
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, address the patch coverage gaps below, and push all of these changes in a single
commit.
# Copilot Comments
## COPILOT-1
Location: src/denbust/discovery/query_yield.py:67
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/205#discussion_r3399222722
Root author: copilot-pull-request-reviewer
Comment:
QueryYieldStore.load() will raise if query_yield.json exists but is corrupted/partially-written (json.JSONDecodeError) or contains non-int values (ValueError/TypeError). Since this cache is read during normal discovery runs, it should fail closed (treat as empty) instead of aborting the run.
# Patch coverage
Patch test coverage is 68.57%; please raise it to 100%. These are the uncovered code lines:
- src/denbust/cli.py: 758, 759, 760, 761, 763, 764, 765, 766, 767, 768, 773, 774, 776, 777, 779, 780, 781, 782, 783, 784
- src/denbust/discovery/query_yield.py: 66
- src/denbust/pipeline.py: 981Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 3 of 3 (final). Measures which keywords/terms actually produced index-relevant records and prioritizes them when a query budget caps a run — spend goes to proven enforcement-law terms over noisy generic ones.
Signal chain (record → candidate → query): an index-relevant record lists its
event_candidate_ids; each candidate records thediscovery_queriesthat found it; each query text feeding a relevant record earns a yield point (deduped per record).query_yield.py:compute_query_yield(pure) +QueryYieldStore(JSON cache).select_run_queries/build_discovery_queriesgain ayield_ofcallback; the cap now orders by (yield, kind priority, recency).$-budget guard.denbust query-yieldcomputes from the operational + candidate stores, cachesquery_yield.json, prints top keywords.Validated on the live index (16 records):
The proven taxonomy terms top the list — not generic keywords like
ליוויthat produce spam.Test plan
Completes the 3-PR search prioritization/batching stack (#5 budget ledger → #4 rotation → #3 yield).
🤖 Generated with Claude Code