T6 M0: Technical plan + analysis notebook for multi-objective vector … by carlosrod723 · Pull Request #61 · AgentOpt/OpenTrace

carlosrod723 · 2026-02-09T20:15:50Z

M0 delivery for T6 Multi-Objective Vector Scores.

Deliverables:

docs/T6_technical_plan.md — Refined tech plan with API signatures, edge cases, test plan
examples/notebooks/t6_m0_analysis.ipynb — Colab notebook (no API keys needed)

Notebook demonstrates current baseline behavior and a working prototype of weighted vs Pareto selection with deterministic tie-break validation.

…scores

…M required)

…, evaluate_vector, BasicSearch integration, 59 tests

…k, add weight-sensitivity demo

chinganc · 2026-02-13T19:48:27Z

docs/T6_technical_plan.md

+        """
+        score, _ = self.get_feedback(query, response, reference, **kwargs)
+        if isinstance(score, dict):
+            return float(np.mean(list(score.values())))


We should leave this behavior to be configurable from the Objective side.
It should not be hard coded here.

Also why do we need this method from the Guide to begin with? I guess the question is whether we would require passing objective into Guide?
Or asked differently, should the Guide be the one who creates the Objective and sends them around? @allenanie what do you think?

chinganc · 2026-02-13T19:49:42Z

docs/T6_technical_plan.md

+    """
+    ...
+
+def aggregate_vector_scores(scores: list) -> Union[float, Dict[str, float]]:


As above, the logic should be implemented by Objective.

chinganc · 2026-02-13T19:52:25Z

docs/T6_technical_plan.md

+Isolate all multi-objective logic into one new module (`opto/trainer/objectives.py`) containing **pure functions**:
+
+```
+normalize_score()   →  scalar ↔ dict conversion


Let's use a different name. normalize_score implies some sort of scaling or shifting is done.
Let's use something explicit like to_score_dict or some term that is more neutral

chinganc

@doxav @allenanie

doxav · 2026-02-15T19:21:08Z

Hi @chinganc

I propose to address your comments by moving all dict -> scalar + aggregation policy into opto/trainer/objectives.py (Objective side), and making the behavior configurable via ObjectiveConfig.

Concretely:

Rename normalize_score -> to_score_dict (with a backwards-compatible alias).
Add ObjectiveConfig.scalarize_dict ∈ {"score","mean","weighted"} + score_key so dict→scalar reduction is never silently hard-coded in Guide/Evaluator.
Implement dict-> scalar reduction in objectives.py (score_dict_to_scalar / to_scalar_score) and use it in select_best/select_top_k for scalar-mode fallbacks.
Move mean-per-metric aggregation into objectives.py (aggregate_score_dicts) and make evaluators.aggregate_vector_scores a thin wrapper.
Update the M1 notebook + technical plan to demonstrate scalarize_dict explicitly and to recommend overriding Guide.get_score_dict() (not changing get_feedback() return type).

This keeps Guide responsible for producing raw metrics, and keeps ObjectiveConfig (trainer-side) responsible for aggregation/scalarization/selection, without passing ObjectiveConfig into the Guide.

…larize_dict, aggregate to objectives.py

…ctive_convex_fn.py

…ch + 12 integration tests

…emo + plots

…ation, better plots

…zon exhaustion during multi-candidate validation

…dcoded 0.0 when no test_dataset

…st exec()

…optimization

chinganc · 2026-02-26T04:25:13Z

@doxav In Guide, I think get_score_dict is a bit overlapping with get_feedback. For multi-objective problem, both get_score_dict and get_feedback have to be defined. I think we should require the user only to implement get_feedback and allow it to return score_dict directly. Any good ideas?

carlosrod723 · 2026-02-26T15:05:31Z

Good point @chinganc. Here's a proposal:

Keep get_score_dict() as an optional override, but make the base Guide class smarter so users only need to implement get_feedback():
pythondef get_feedback(self, query, response, reference=None, **kwargs):
"""Return (score, feedback_str).

score can be float (scalar) or Dict[str, float] (multi-objective).
"""
raise NotImplementedError

def get_score_dict(self, query, response, reference=None, **kwargs):
"""Return evaluation score as a dict.

Default: calls get_feedback() and wraps result.
- If get_feedback returns (dict, str): returns dict directly
- If get_feedback returns (float, str): returns {"score": float}

Override only when the scoring path needs different behavior
(e.g., deepcopy of stateful env, extra metrics like token counts).
"""
score, _ = self.get_feedback(query, response, reference, **kwargs)
if isinstance(score, dict):
    return {k: float(v) for k, v in score.items()}
return {"score": float(score)}

This means:

Simple case --> user implements get_feedback() returning (dict, str) → works for both scalar and multi-objective automatically
Advanced case --> user overrides get_score_dict() when scoring needs different behavior (e.g., deepcopy for stateful envs, or augmenting with token metrics)

No breaking changes. The existing guides that return (float, str) still work. Guides that want multi-objective just return (dict, str) from get_feedback().

metric() would also need to handle dict scores from get_feedback(). When it gets a dict, it can use the scalarize_dict policy from ObjectiveConfig, or fall back to mean.

Want me to implement this?

chinganc · 2026-02-26T17:51:57Z

How about this?

Having the user to implement a new _get_feedback which can return either float or dict like the signature you said above. We define get_feedback to calls _get_feedback and then standardize that into a dict, so that all the following usages in Trace (e.g. trainers) can always assume getting dict from get_feedback. (that is, we move the logic get_score_dict above to here).

For thew new get_score_dict, it just does indexing, and metric further does serialization based on what get_score_dict returns.

carlosrod723 · 2026-02-26T17:57:46Z

Thanks Ching-An. I like this approach, it's cleaner than what I proposed. To confirm my understanding:

User implements _get_feedback(query, response, reference, **kwargs) → (float, str) or (Dict[str, float], str)
Base get_feedback() calls _get_feedback() and normalizes: if float, wraps as {"score": float} → always returns (dict, str)
get_score_dict() indexes into the dict (e.g., selects objective based on config)
metric() serializes from get_score_dict()

This means trainers can always assume dict from get_feedback(). There's no branching on type. I'll update the PR to follow this pattern. One edge case to flag --> our TokenUsageAugmentingGuide for GSM8K augments scores with token metrics not present in _get_feedback(). I'll handle that as an override of get_feedback() that merges the extra metrics after calling super().

chinganc · 2026-02-26T18:00:54Z

Sounds good

…

Sent from my iPhone

On Thu, Feb 26, 2026 at 9:58 AM Carlos Rodriguez ***@***.***> wrote: *carlosrod723* left a comment (AgentOpt/OpenTrace#61) <#61 (comment)> Thanks Ching-An. I like this approach, it's cleaner than what I proposed. To confirm my understanding: - User implements _get_feedback(query, response, reference, **kwargs) → (float, str) or (Dict[str, float], str) - Base get_feedback() calls _get_feedback() and normalizes: if float, wraps as {"score": float} → always returns (dict, str) - get_score_dict() indexes into the dict (e.g., selects objective based on config) - metric() serializes from get_score_dict() This means trainers can always assume dict from get_feedback(). There's no branching on type. I'll update the PR to follow this pattern. One edge case to flag --> our TokenUsageAugmentingGuide for GSM8K augments scores with token metrics not present in _get_feedback(). I'll handle that as an override of get_feedback() that merges the extra metrics after calling super(). — Reply to this email directly, view it on GitHub <#61 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEHPJGGELTYRTDFURLBIH5L4N4X3BAVCNFSM6AAAAACUQW42DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSNRYGI2TEMJTGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Jose Carlos Rodriguez added 6 commits February 9, 2026 16:10

T6 M0: Technical plan + analysis notebook for multi-objective vector …

9506921

…scores

T6 M0: Apply Xavier's review fixes (paths, dates, motivation, real LL…

3b2a0b2

…M required)

T6 M0: Apply Xavier's review fixes to technical plan

249bde6

T6 M1: Multi-objective vector scores — ObjectiveConfig, objectives.py…

2213a19

…, evaluate_vector, BasicSearch integration, 59 tests

T6 M1: Fix Colab install cell for Python 3.12 compatibility

4590102

T6 M1: Fix scalar objective computation, document config=None fallbac…

3b8d2ed

…k, add weight-sensitivity demo

chinganc reviewed Feb 13, 2026

View reviewed changes

Jose Carlos Rodriguez added 14 commits February 16, 2026 11:29

T6 M1: Apply Ching-An review - to_score_dict rename, configurable sca…

7401ca2

…larize_dict, aggregate to objectives.py

T6 M2 prep: align tech plan per Xavier review, add Allen's multi_obje…

e2a66de

…ctive_convex_fn.py

T6 M2: Multi-objective support for BeamsearchAlgorithm + PrioritySear…

270a1b6

…ch + 12 integration tests

T6 M2: Add validation notebook with convex function multi-objective d…

a888b93

…emo + plots

fix: define _repo_root in Colab setup cell for notebook

bdd5260

fix: reset Colab working directory before clone in setup cell

94a1731

fix: improve notebook charts - more training steps, proper loss evalu…

a47cdc1

…ation, better plots

fix: move DATASET definition to imports cell for reliability

0a79496

fix: RewardGuide get_feedback() now uses deepcopy - prevents env hori…

ca1349b

…zon exhaustion during multi-candidate validation

fix: Beamsearch train() returns final_validation_score instead of har…

427cb3f

…dcoded 0.0 when no test_dataset

T6 M2: BBEH notebook with dotenv + OpenAI direct support

106bd11

fix: BBEH notebook Colab setup cell

d16ef41

fix: BBEH notebook measures end-to-end graph time (LLM + exec) not ju…

5c1b7e3

…st exec()

fix: BBEH notebook resilient to Trace bundle corruption after failed …

b8f5023

…optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T6 M0: Technical plan + analysis notebook for multi-objective vector …#61

T6 M0: Technical plan + analysis notebook for multi-objective vector …#61
carlosrod723 wants to merge 20 commits intoAgentOpt:experimentalfrom
carlosrod723:t6-multi-objective-m0

carlosrod723 commented Feb 9, 2026

Uh oh!

chinganc Feb 13, 2026

Uh oh!

chinganc Feb 13, 2026

Uh oh!

chinganc Feb 13, 2026

Uh oh!

chinganc Feb 13, 2026

Uh oh!

chinganc left a comment

Uh oh!

doxav commented Feb 15, 2026 •

edited

Loading

Uh oh!

chinganc commented Feb 26, 2026 •

edited

Loading

Uh oh!

carlosrod723 commented Feb 26, 2026

Uh oh!

chinganc commented Feb 26, 2026

Uh oh!

carlosrod723 commented Feb 26, 2026

Uh oh!

chinganc commented Feb 26, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

carlosrod723 commented Feb 9, 2026

Uh oh!

chinganc Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

chinganc Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

chinganc Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

chinganc Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

chinganc left a comment

Choose a reason for hiding this comment

Uh oh!

doxav commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chinganc commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlosrod723 commented Feb 26, 2026

Uh oh!

chinganc commented Feb 26, 2026

Uh oh!

carlosrod723 commented Feb 26, 2026

Uh oh!

chinganc commented Feb 26, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

doxav commented Feb 15, 2026 •

edited

Loading

chinganc commented Feb 26, 2026 •

edited

Loading