Skip to content

feat(discovery): yield-weighted query prioritization (#3)#205

Merged
shaypal5 merged 1 commit into
mainfrom
codex/yield-weighted-query-priority
Jun 11, 2026
Merged

feat(discovery): yield-weighted query prioritization (#3)#205
shaypal5 merged 1 commit into
mainfrom
codex/yield-weighted-query-priority

Conversation

@shaypal5

Copy link
Copy Markdown
Member

Summary

PR 3 of 3 (final). Measures which keywords/terms actually produced index-relevant records and prioritizes them when a query budget caps a run — spend goes to proven enforcement-law terms over noisy generic ones.

Signal chain (record → candidate → query): an index-relevant record lists its event_candidate_ids; each candidate records the discovery_queries that found it; each query text feeding a relevant record earns a yield point (deduped per record).

  • query_yield.py: compute_query_yield (pure) + QueryYieldStore (JSON cache).
  • select_run_queries / build_discovery_queries gain a yield_of callback; the cap now orders by (yield, kind priority, recency).
  • Engine fns load the cached yield map and pass it to both build and the $-budget guard.
  • CLI: denbust query-yield computes from the operational + candidate stores, caches query_yield.json, prints top keywords.

Validated on the live index (16 records):

  6  הבאת אדם למדינה אחרת לשם העיסוק בזנות   (trafficking for prostitution)
  5  זירת זנות / המודל הנורדי
  4  סחר בנשים / החזקת מקום לשם זנות / השכרת מקום לשם זנות

The proven taxonomy terms top the list — not generic keywords like ליווי that produce spam.

Test plan

  • 5 new unit tests (yield credit + per-record dedup + empty, store round-trip, high-yield selection wins)
  • 1364 unit tests pass; ruff + mypy clean

Completes the 3-PR search prioritization/batching stack (#5 budget ledger → #4 rotation → #3 yield).

🤖 Generated with Claude Code

PR 3 of 3. Measures which keywords/terms actually produced index-relevant
records and prioritizes them when a query budget caps a run, so spend goes to
proven enforcement-law terms over noisy generic ones.

Signal chain (record -> candidate -> query):
- an index-relevant operational record lists its event_candidate_ids;
- each candidate records the discovery_queries that found it;
- each query text feeding a relevant record earns one yield point (deduped per
  record).

- query_yield.py: compute_query_yield (pure) + QueryYieldStore (JSON cache).
- select_run_queries / build_discovery_queries gain a yield_of callback; the cap
  now orders by (yield, kind priority, recency, index).
- Engine fns load the cached yield map and pass yield_of to both build and the
  $-budget guard.
- CLI `denbust query-yield` computes from the operational + candidate stores,
  caches query_yield.json, and prints the top keywords.

Validated on the live index (16 records): top-yield terms are the specific
trafficking/prostitution-law taxonomy terms ("הבאת אדם למדינה אחרת לשם העיסוק
בזנות", "המודל הנורדי", "סחר בנשים"), not generic keywords — exactly the spend
priority we want.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 11, 2026 21:12
@shaypal5 shaypal5 merged commit 45892b1 into main Jun 11, 2026
1 check was pending
@shaypal5 shaypal5 deleted the codex/yield-weighted-query-priority branch June 11, 2026 21:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a yield-based signal to discovery query selection so capped runs spend limited query budget on historically productive keywords (those that have contributed to index-relevant records), with a cached yield map and a CLI to compute it.

Changes:

  • Introduce compute_query_yield + QueryYieldStore to compute and cache per-query yield in query_yield.json.
  • Thread an optional yield_of callback through build_discovery_queries/select_run_queries and the engine budget guard so ordering becomes (yield, kind priority, recency).
  • Add denbust query-yield CLI command, unit tests, and protocol docs updates.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/test_query_yield.py Adds unit coverage for yield computation, store round-trip, and high-yield prioritization under caps.
src/denbust/pipeline.py Loads cached yield map and applies yield-based ordering for engine query building and budget-guard capping.
src/denbust/discovery/state_paths.py Adds a dedicated query_yield_path under discovery state.
src/denbust/discovery/query_yield.py New module implementing yield computation and JSON cache store.
src/denbust/discovery/queries.py Extends query selection/building to accept yield callback and incorporate it into ordering.
src/denbust/cli.py Adds denbust query-yield command to compute/cache yield and print top terms.
docs/batch_scraping_protocol.md Documents yield-weighted prioritization and how to compute/cache yield.

Comment on lines +61 to +67
def load(self) -> dict[str, int]:
if not self.path.exists():
return {}
data = json.loads(self.path.read_text(encoding="utf-8"))
if not isinstance(data, dict):
return {}
return {str(k): int(v) for k, v in data.items()}
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.83333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.81%. Comparing base (6a5fbc7) to head (20e8648).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/denbust/discovery/query_yield.py 96.96% 1 Missing ⚠️
src/denbust/pipeline.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #205   +/-   ##
=======================================
  Coverage   92.80%   92.81%           
=======================================
  Files          82       83    +1     
  Lines       12215    12259   +44     
=======================================
+ Hits        11336    11378   +42     
- Misses        879      881    +2     
Files with missing lines Coverage Δ
src/denbust/discovery/queries.py 97.01% <100.00%> (+0.02%) ⬆️
src/denbust/discovery/state_paths.py 100.00% <100.00%> (ø)
src/denbust/discovery/query_yield.py 96.96% <96.96%> (ø)
src/denbust/pipeline.py 95.48% <90.00%> (-0.05%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes an unresolved review comment and patch coverage gaps on PR #205.

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, address the patch coverage gaps below, and push all of these changes in a single
commit.

# Copilot Comments

## COPILOT-1
Location: src/denbust/discovery/query_yield.py:67
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/205#discussion_r3399222722
Root author: copilot-pull-request-reviewer

Comment:
    QueryYieldStore.load() will raise if query_yield.json exists but is corrupted/partially-written (json.JSONDecodeError) or contains non-int values (ValueError/TypeError). Since this cache is read during normal discovery runs, it should fail closed (treat as empty) instead of aborting the run.

# Patch coverage

Patch test coverage is 68.57%; please raise it to 100%. These are the uncovered code lines:
- src/denbust/cli.py: 758, 759, 760, 761, 763, 764, 765, 766, 767, 768, 773, 774, 776, 777, 779, 780, 781, 782, 783, 784
- src/denbust/discovery/query_yield.py: 66
- src/denbust/pipeline.py: 981

Run metadata:

Tool ref: v4.0.19
Tool version: 4.0.19
Trigger: pull request opened
Workflow run: 27377750337 attempt 1
Comment timestamp: 2026-06-11T21:15:35.509275+00:00
PR head commit: 20e8648bd28378b3cfee241cbcbf610b19cf88ba

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants