feat(discovery): yield-weighted query prioritization (#3) by shaypal5 · Pull Request #205 · DataHackIL/tfht_enforce_idx

shaypal5 · 2026-06-11T21:12:16Z

Summary

PR 3 of 3 (final). Measures which keywords/terms actually produced index-relevant records and prioritizes them when a query budget caps a run — spend goes to proven enforcement-law terms over noisy generic ones.

Signal chain (record → candidate → query): an index-relevant record lists its event_candidate_ids; each candidate records the discovery_queries that found it; each query text feeding a relevant record earns a yield point (deduped per record).

query_yield.py: compute_query_yield (pure) + QueryYieldStore (JSON cache).
select_run_queries / build_discovery_queries gain a yield_of callback; the cap now orders by (yield, kind priority, recency).
Engine fns load the cached yield map and pass it to both build and the $-budget guard.
CLI: denbust query-yield computes from the operational + candidate stores, caches query_yield.json, prints top keywords.

Validated on the live index (16 records):

  6  הבאת אדם למדינה אחרת לשם העיסוק בזנות   (trafficking for prostitution)
  5  זירת זנות / המודל הנורדי
  4  סחר בנשים / החזקת מקום לשם זנות / השכרת מקום לשם זנות

The proven taxonomy terms top the list — not generic keywords like ליווי that produce spam.

Test plan

5 new unit tests (yield credit + per-record dedup + empty, store round-trip, high-yield selection wins)
1364 unit tests pass; ruff + mypy clean

Completes the 3-PR search prioritization/batching stack (#5 budget ledger → #4 rotation → #3 yield).

🤖 Generated with Claude Code

PR 3 of 3. Measures which keywords/terms actually produced index-relevant records and prioritizes them when a query budget caps a run, so spend goes to proven enforcement-law terms over noisy generic ones. Signal chain (record -> candidate -> query): - an index-relevant operational record lists its event_candidate_ids; - each candidate records the discovery_queries that found it; - each query text feeding a relevant record earns one yield point (deduped per record). - query_yield.py: compute_query_yield (pure) + QueryYieldStore (JSON cache). - select_run_queries / build_discovery_queries gain a yield_of callback; the cap now orders by (yield, kind priority, recency, index). - Engine fns load the cached yield map and pass yield_of to both build and the $-budget guard. - CLI `denbust query-yield` computes from the operational + candidate stores, caches query_yield.json, and prints the top keywords. Validated on the live index (16 records): top-yield terms are the specific trafficking/prostitution-law taxonomy terms ("הבאת אדם למדינה אחרת לשם העיסוק בזנות", "המודל הנורדי", "סחר בנשים"), not generic keywords — exactly the spend priority we want. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a yield-based signal to discovery query selection so capped runs spend limited query budget on historically productive keywords (those that have contributed to index-relevant records), with a cached yield map and a CLI to compute it.

Changes:

Introduce compute_query_yield + QueryYieldStore to compute and cache per-query yield in query_yield.json.
Thread an optional yield_of callback through build_discovery_queries/select_run_queries and the engine budget guard so ordering becomes (yield, kind priority, recency).
Add denbust query-yield CLI command, unit tests, and protocol docs updates.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/unit/test_query_yield.py	Adds unit coverage for yield computation, store round-trip, and high-yield prioritization under caps.
src/denbust/pipeline.py	Loads cached yield map and applies yield-based ordering for engine query building and budget-guard capping.
src/denbust/discovery/state_paths.py	Adds a dedicated `query_yield_path` under discovery state.
src/denbust/discovery/query_yield.py	New module implementing yield computation and JSON cache store.
src/denbust/discovery/queries.py	Extends query selection/building to accept yield callback and incorporate it into ordering.
src/denbust/cli.py	Adds `denbust query-yield` command to compute/cache yield and print top terms.
docs/batch_scraping_protocol.md	Documents yield-weighted prioritization and how to compute/cache yield.

+    def load(self) -> dict[str, int]:
+        if not self.path.exists():
+            return {}
+        data = json.loads(self.path.read_text(encoding="utf-8"))
+        if not isinstance(data, dict):
+            return {}
+        return {str(k): int(v) for k, v in data.items()}


codecov · 2026-06-11T21:15:55Z

Codecov Report

❌ Patch coverage is 95.83333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.81%. Comparing base (6a5fbc7) to head (20e8648).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/denbust/discovery/query_yield.py	96.96%	1 Missing ⚠️
src/denbust/pipeline.py	90.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #205   +/-   ##
=======================================
  Coverage   92.80%   92.81%           
=======================================
  Files          82       83    +1     
  Lines       12215    12259   +44     
=======================================
+ Hits        11336    11378   +42     
- Misses        879      881    +2

Files with missing lines	Coverage Δ
src/denbust/discovery/queries.py	`97.01% <100.00%> (+0.02%)`	⬆️
src/denbust/discovery/state_paths.py	`100.00% <100.00%> (ø)`
src/denbust/discovery/query_yield.py	`96.96% <96.96%> (ø)`
src/denbust/pipeline.py	`95.48% <90.00%> (-0.05%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-06-11T21:16:24Z

pr-agent-context report:

This run includes an unresolved review comment and patch coverage gaps on PR #205.

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, address the patch coverage gaps below, and push all of these changes in a single
commit.

# Copilot Comments

## COPILOT-1
Location: src/denbust/discovery/query_yield.py:67
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/205#discussion_r3399222722
Root author: copilot-pull-request-reviewer

Comment:
    QueryYieldStore.load() will raise if query_yield.json exists but is corrupted/partially-written (json.JSONDecodeError) or contains non-int values (ValueError/TypeError). Since this cache is read during normal discovery runs, it should fail closed (treat as empty) instead of aborting the run.

# Patch coverage

Patch test coverage is 68.57%; please raise it to 100%. These are the uncovered code lines:
- src/denbust/cli.py: 758, 759, 760, 761, 763, 764, 765, 766, 767, 768, 773, 774, 776, 777, 779, 780, 781, 782, 783, 784
- src/denbust/discovery/query_yield.py: 66
- src/denbust/pipeline.py: 981

Run metadata:

Tool ref: v4.0.19
Tool version: 4.0.19
Trigger: pull request opened
Workflow run: 27377750337 attempt 1
Comment timestamp: 2026-06-11T21:15:35.509275+00:00
PR head commit: 20e8648bd28378b3cfee241cbcbf610b19cf88ba

Copilot AI review requested due to automatic review settings June 11, 2026 21:12

shaypal5 merged commit 45892b1 into main Jun 11, 2026
1 check was pending

shaypal5 deleted the codex/yield-weighted-query-priority branch June 11, 2026 21:12

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(discovery): yield-weighted query prioritization (#3)#205

feat(discovery): yield-weighted query prioritization (#3)#205
shaypal5 merged 1 commit into
mainfrom
codex/yield-weighted-query-priority

shaypal5 commented Jun 11, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented Jun 11, 2026

Summary

Test plan

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov Bot commented Jun 11, 2026

Codecov Report

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants