Skip to content

feat: add deterministic quality scoring engine with tests#15

Open
yuliuyi717-ux wants to merge 2 commits intoMint-Claw:mainfrom
yuliuyi717-ux:codex/quality-scoring-issue-1
Open

feat: add deterministic quality scoring engine with tests#15
yuliuyi717-ux wants to merge 2 commits intoMint-Claw:mainfrom
yuliuyi717-ux:codex/quality-scoring-issue-1

Conversation

@yuliuyi717-ux
Copy link

/claim #1

Implemented a deterministic multi-dimensional quality scorer for structured submissions with required output schema, benchmark coverage, and sample scorecards.

What is included:

  • quality_scorer.py
    • auto-detects json, markdown, code, text
    • scores 5 dimensions with required weights:
      • completeness 0.30
      • format_compliance 0.20
      • coverage 0.25
      • clarity 0.15
      • validity 0.10
    • returns required schema:
      • weighted_score
      • quality_rating
      • scores
      • feedback
      • pass_threshold
  • tests/test_quality_scorer.py
    • format detection tests
    • output schema test
    • weighted-score consistency test
    • 100 submissions performance test (<10s)
    • ground-truth tolerance test (mae <= 0.05)
    • sample scorecards file integrity test
  • sample_scorecards.json
    • 20 sample scored outputs
  • scripts/generate_sample_scorecards.py
    • regenerates the sample scorecards

Validation:

  • python3 -m unittest discover -s tests -p 'test_*.py'

@yuliuyi717-ux
Copy link
Author

Quick verification note:

  • python3 -m unittest discover -s tests -p 'test_*.py' passes locally.
  • The suite includes format detection, schema validation, weighted-score consistency, and a 100-submission benchmark check under 10s.
  • Included sample_scorecards.json with 20 generated scorecards plus a generator script for reproducibility.

If you want stricter calibration checks against your provided 20-item ground-truth set, I can wire that in directly.

@yuliuyi717-ux
Copy link
Author

Follow-up update pushed:

  • added ground-truth calibration utility (evaluate_ground_truth_submission_set) to compute MAE + tolerance pass/fail
  • added CLI script (scripts/evaluate_ground_truth.py) to validate against a provided 20-item truth set
  • added test coverage for tolerance gating

Validation:

  • PYTHONPATH=. python3 -m unittest discover -s tests -p 'test_*.py'

This directly strengthens the acceptance item around +/-0.05 calibration checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant