Skip to content

T6 M0: Technical plan + analysis notebook for multi-objective vector …#61

Open
carlosrod723 wants to merge 20 commits intoAgentOpt:experimentalfrom
carlosrod723:t6-multi-objective-m0
Open

T6 M0: Technical plan + analysis notebook for multi-objective vector …#61
carlosrod723 wants to merge 20 commits intoAgentOpt:experimentalfrom
carlosrod723:t6-multi-objective-m0

Conversation

@carlosrod723
Copy link

M0 delivery for T6 Multi-Objective Vector Scores.

Deliverables:

  • docs/T6_technical_plan.md — Refined tech plan with API signatures, edge cases, test plan
  • examples/notebooks/t6_m0_analysis.ipynb — Colab notebook (no API keys needed)

Notebook demonstrates current baseline behavior and a working prototype of weighted vs Pareto selection with deterministic tie-break validation.

"""
score, _ = self.get_feedback(query, response, reference, **kwargs)
if isinstance(score, dict):
return float(np.mean(list(score.values())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should leave this behavior to be configurable from the Objective side.
It should not be hard coded here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why do we need this method from the Guide to begin with? I guess the question is whether we would require passing objective into Guide?
Or asked differently, should the Guide be the one who creates the Objective and sends them around? @allenanie what do you think?

"""
...

def aggregate_vector_scores(scores: list) -> Union[float, Dict[str, float]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, the logic should be implemented by Objective.

Isolate all multi-objective logic into one new module (`opto/trainer/objectives.py`) containing **pure functions**:

```
normalize_score() → scalar ↔ dict conversion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a different name. normalize_score implies some sort of scaling or shifting is done.
Let's use something explicit like to_score_dict or some term that is more neutral

Copy link
Member

@chinganc chinganc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doxav
Copy link
Collaborator

doxav commented Feb 15, 2026

Hi @chinganc

I propose to address your comments by moving all dict -> scalar + aggregation policy into opto/trainer/objectives.py (Objective side), and making the behavior configurable via ObjectiveConfig.

Concretely:

  • Rename normalize_score -> to_score_dict (with a backwards-compatible alias).
  • Add ObjectiveConfig.scalarize_dict ∈ {"score","mean","weighted"} + score_key so dict→scalar reduction is never silently hard-coded in Guide/Evaluator.
  • Implement dict-> scalar reduction in objectives.py (score_dict_to_scalar / to_scalar_score) and use it in select_best/select_top_k for scalar-mode fallbacks.
  • Move mean-per-metric aggregation into objectives.py (aggregate_score_dicts) and make evaluators.aggregate_vector_scores a thin wrapper.
  • Update the M1 notebook + technical plan to demonstrate scalarize_dict explicitly and to recommend overriding Guide.get_score_dict() (not changing get_feedback() return type).

This keeps Guide responsible for producing raw metrics, and keeps ObjectiveConfig (trainer-side) responsible for aggregation/scalarization/selection, without passing ObjectiveConfig into the Guide.

@chinganc
Copy link
Member

chinganc commented Feb 26, 2026

@doxav In Guide, I think get_score_dict is a bit overlapping with get_feedback. For multi-objective problem, both get_score_dict and get_feedback have to be defined. I think we should require the user only to implement get_feedback and allow it to return score_dict directly. Any good ideas?

@carlosrod723
Copy link
Author

Good point @chinganc. Here's a proposal:

  • Keep get_score_dict() as an optional override, but make the base Guide class smarter so users only need to implement get_feedback():
    pythondef get_feedback(self, query, response, reference=None, **kwargs):
    """Return (score, feedback_str).

    score can be float (scalar) or Dict[str, float] (multi-objective).
    """
    raise NotImplementedError

def get_score_dict(self, query, response, reference=None, **kwargs):
"""Return evaluation score as a dict.

Default: calls get_feedback() and wraps result.
- If get_feedback returns (dict, str): returns dict directly
- If get_feedback returns (float, str): returns {"score": float}

Override only when the scoring path needs different behavior
(e.g., deepcopy of stateful env, extra metrics like token counts).
"""
score, _ = self.get_feedback(query, response, reference, **kwargs)
if isinstance(score, dict):
    return {k: float(v) for k, v in score.items()}
return {"score": float(score)}

This means:

  • Simple case --> user implements get_feedback() returning (dict, str) → works for both scalar and multi-objective automatically
  • Advanced case --> user overrides get_score_dict() when scoring needs different behavior (e.g., deepcopy for stateful envs, or augmenting with token metrics)

No breaking changes. The existing guides that return (float, str) still work. Guides that want multi-objective just return (dict, str) from get_feedback().

metric() would also need to handle dict scores from get_feedback(). When it gets a dict, it can use the scalarize_dict policy from ObjectiveConfig, or fall back to mean.

Want me to implement this?

@chinganc
Copy link
Member

How about this?

Having the user to implement a new _get_feedback which can return either float or dict like the signature you said above. We define get_feedback to calls _get_feedback and then standardize that into a dict, so that all the following usages in Trace (e.g. trainers) can always assume getting dict from get_feedback. (that is, we move the logic get_score_dict above to here).

For thew new get_score_dict, it just does indexing, and metric further does serialization based on what get_score_dict returns.

@carlosrod723
Copy link
Author

Thanks Ching-An. I like this approach, it's cleaner than what I proposed. To confirm my understanding:

  • User implements _get_feedback(query, response, reference, **kwargs) → (float, str) or (Dict[str, float], str)
  • Base get_feedback() calls _get_feedback() and normalizes: if float, wraps as {"score": float} → always returns (dict, str)
  • get_score_dict() indexes into the dict (e.g., selects objective based on config)
  • metric() serializes from get_score_dict()

This means trainers can always assume dict from get_feedback(). There's no branching on type. I'll update the PR to follow this pattern. One edge case to flag --> our TokenUsageAugmentingGuide for GSM8K augments scores with token metrics not present in _get_feedback(). I'll handle that as an override of get_feedback() that merges the extra metrics after calling super().

@chinganc
Copy link
Member

chinganc commented Feb 26, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants