Skip to content

refactor(evaluator): extract LLMBackedMetric base class to remove duplicated _llm init in built-in metrics #175

Description

@qa-bob

Describe the bug

Summary

Every built-in metric in arksim/evaluator/builtin_metrics.py follows the
same pattern:

class HelpfulnessMetric(QuantitativeMetric):
    def __init__(self, llm: LLM) -> None:
        super().__init__(name="helpfulness")
        self._llm = llm          # ← set directly, bypasses mixin

The super().__init__() call omits llm=llm, and then self._llm is
assigned manuallybypassing _LLMMixin's intended initialization path.
This is repeated identically across 7 classes. The file even contains a
# TODO acknowledging the duplication:

# TODO: we can define a shared metric class that inherits from quant and qual
# metric that has the _llm, _system_prompt, and _user_prompt_template...

### Steps to reproduce

Run the code

### Expected behavior

Suggested approach

Introduce an intermediate base class for LLM-backed metrics that stores the
prompt templates and handles _llm initialization correctly:

class _PromptMetric(QuantitativeMetric):
    """Base for built-in metrics that call an LLM with a fixed prompt pair."""

    _system_prompt: str
    _user_prompt_template: str

    def __init__(self, name: str, llm: BaseLLM) -> None:
        super().__init__(name=name, llm=llm)   # correctly passes llm to mixin

    def score(self, score_input: ScoreInput) -> QuantResult:
        response = self.llm.call(
            [
                {"role": "system", "content": self._system_prompt},
                {"role": "user", "content": self._user_prompt_template.format(
                    **score_input.model_dump()
                )},
            ],
            schema=ScoreSchema,
        )
        return QuantResult(name=self.name, value=response.score, reason=response.reason)

Each concrete metric then only needs to declare its prompts:

class HelpfulnessMetric(_PromptMetric):
    _system_prompt = helpfulness_system_prompt
    _user_prompt_template = helpfulness_user_prompt

    def __init__(self, llm: BaseLLM) -> None:
        super().__init__(name="helpfulness", llm=llm)

Benefits

- Resolves the existing # TODO
- _llm is initialized via the mixinthe intended path
- Adding a new built-in metric is a 3-line subclass

### Error output or logs

```shell

ArkSim version

1.0

Python version

3.11

Operating system

macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions