Skip to content

Comments

feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows.#679

Open
ffrujeri wants to merge 10 commits intoffrujeri/genrm-modelfrom
ffrujeri/genrm-compare-rs
Open

feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows.#679
ffrujeri wants to merge 10 commits intoffrujeri/genrm-modelfrom
ffrujeri/genrm-compare-rs

Conversation

@ffrujeri
Copy link
Contributor

@ffrujeri ffrujeri commented Feb 12, 2026

GenRM Compare Resource Server & Cohort-Based Verify

What does this PR do?

This PR adds a production-ready Resource Server for comparing multiple candidate responses using GenRM models, and moves RLHF-specific reward logic (cohort buffering and comparison) into that server so that rollout collection and consumer libraries (e.g. nemo-rl) stay generic.

Issues

Summary

In RLHF, rewards are relative to other rollouts for the same task (e.g. same prompt), not independent. This PR addresses that by:

  • Cohort-based verify: The genrm_compare server’s /verify endpoint buffers rollouts by prompt (and optional principle). When num_rollouts_per_prompt rollouts have been received for a prompt, it runs pairwise comparison, aggregates scores, and returns the appropriate reward to each of the N callers. Callers naturally “wait” until their cohort is complete via the async verify flow.
  • No RLHF hacks in Gym or NeMo RL: Rollout collection stays a simple “post each row to agent /run”. The agent calls the resources server’s /verify with the response; genrm_compare owns all buffering and comparison. No comparison strategy or prompt buffering in rollout_collection.py.

Key features

  • Cohort-based verify: Configurable num_rollouts_per_prompt; verify buffers by prompt (and principle), runs comparison when cohort is full, distributes rewards.
  • Batch /compare API: Direct comparison of N response_objs (e.g. for scripts or tests).
  • Pairwise comparison: Circular and all-pairs strategies; tiebreaker and length-based bonuses; optional principle-based judging.
  • GenRM model alignment: Config aligned with genrm_model (server name genrm_model; custom roles response_1, response_2, principle).
  • Clean boundaries: Zero GenRM-specific code in rollout collection or config types; all RLHF logic in genrm_compare.

Architecture

Rollout collection
    └── For each row: POST to agent /run  (unchanged; no strategy or buffering)

Agent (e.g. simple_agent)
    └── /run: generate response → POST to resources server /verify with (params, response, optional principle)

GenRM Compare Resource Server
    ├── /verify (per-rollout)
    │   ├── num_rollouts_per_prompt <= 1 → return default_score
    │   └── num_rollouts_per_prompt > 1:
    │       ├── Buffer by prompt_key (input + principle)
    │       ├── When cohort size == num_rollouts_per_prompt:
    │       │   ├── Run pairwise comparison (GenRM model)
    │       │   ├── Aggregate scores (tiebreaker, length bonuses)
    │       │   └── Resolve all N pending verify callers with their rewards
    │       └── Return this rollout’s reward
    └── /compare (batch)
        └── Compare N response_objs; return rewards + metrics (for scripts/tests)
  • Config: genrm_compare config includes num_rollouts_per_prompt, genrm_model_server (name genrm_model), and comparison/aggregation options. No comparison_strategy in global config for rollout.
  • Data: For RLHF, provide num_rollouts_per_prompt rows per prompt (e.g. via num_repeats when loading data).

Testing

curl -s -X POST http://127.0.0.1:17795/compare \
  -H "Content-Type: application/json" \
  -d '{
    "conversation_history": [{"role": "user", "content": "What is SKILL?"}],
    "response_objs": [
      {"output": [{"type": "message", "content": [{"type": "output_text", "text": "SKILL is a verb meaning to kill."}]}]},
      {"output": [{"type": "message", "content": [{"type": "output_text", "text": "Skill refers to the ability to perform a task well."}]}]}
    ]
  }' | jq .

GenRM returns a response with reasoning and a final message containing JSON scores, e.g.:

{
  "rewards": [
    1.025,
    4.475
  ],
  "comparison_results": [
    {
      "response_i": 0,
      "response_j": 1,
      "judge_idx": 0,
      "score_1": 1.0,
      "score_2": 5.0,
      "ranking": 6.0
    },
    {
      "response_i": 1,
      "response_j": 0,
      "judge_idx": 0,
      "score_1": 4.0,
      "score_2": 1.0,
      "ranking": 1.0
    }
  ],
  "metrics": {
    "mean_individual_score": 2.75,
    "std_individual_score": 1.7853571071357126,
    "tiebreak_usage_rate": 0.0
  }
}

Unit tests cover genrm_compare (verify stub when N≤1, cohort logic, compare), utils (prompt key, parsing, aggregation), and comparison_strategies (batch client and helpers).

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ffrujeri ffrujeri changed the title feat: Adds GenRM pairwise comparison resource servert to support RLHF training workflows. feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows. Feb 13, 2026
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from 43da6c4 to 85f39dd Compare February 14, 2026 23:48
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch from c22967f to dd172a0 Compare February 18, 2026 02:33
@bxyu-nvidia bxyu-nvidia linked an issue Feb 18, 2026 that may be closed by this pull request
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch from dd172a0 to 7458914 Compare February 18, 2026 16:58
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from 1d043b6 to 3000055 Compare February 18, 2026 16:59
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
…e server.

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
@ffrujeri ffrujeri marked this pull request as ready for review February 18, 2026 21:43
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Reward model support

1 participant