feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows. by ffrujeri · Pull Request #679 · NVIDIA-NeMo/Gym

ffrujeri · 2026-02-12T00:42:01Z

GenRM Compare Resource Server & Cohort-Based Verify

What does this PR do?

This PR adds a production-ready Resource Server for comparing multiple candidate responses using GenRM models, and moves RLHF-specific reward logic (cohort buffering and comparison) into that server so that rollout collection and consumer libraries (e.g. nemo-rl) stay generic.

Issues

Related to PR add genrm rlhf #523 (reference).
Part of feat: Reward model support #516.

Summary

In RLHF, rewards are relative to other rollouts for the same task (e.g. same prompt), not independent. This PR addresses that by:

Cohort-based verify: The genrm_compare server’s /verify endpoint buffers rollouts by prompt (and optional principle). When num_rollouts_per_prompt rollouts have been received for a prompt, it runs pairwise comparison, aggregates scores, and returns the appropriate reward to each of the N callers. Callers naturally “wait” until their cohort is complete via the async verify flow.
No RLHF hacks in Gym or NeMo RL: Rollout collection stays a simple “post each row to agent /run”. The agent calls the resources server’s /verify with the response; genrm_compare owns all buffering and comparison. No comparison strategy or prompt buffering in rollout_collection.py.

Key features

Cohort-based verify: Configurable num_rollouts_per_prompt; verify buffers by prompt (and principle), runs comparison when cohort is full, distributes rewards.
Batch /compare API: Direct comparison of N response_objs (e.g. for scripts or tests).
Pairwise comparison: Circular and all-pairs strategies; tiebreaker and length-based bonuses; optional principle-based judging.
GenRM model alignment: Config aligned with genrm_model (server name genrm_model; custom roles response_1, response_2, principle).
Clean boundaries: Zero GenRM-specific code in rollout collection or config types; all RLHF logic in genrm_compare.

Architecture

Rollout collection
    └── For each row: POST to agent /run  (unchanged; no strategy or buffering)

Agent (e.g. simple_agent)
    └── /run: generate response → POST to resources server /verify with (params, response, optional principle)

GenRM Compare Resource Server
    ├── /verify (per-rollout)
    │   ├── num_rollouts_per_prompt <= 1 → return default_score
    │   └── num_rollouts_per_prompt > 1:
    │       ├── Buffer by prompt_key (input + principle)
    │       ├── When cohort size == num_rollouts_per_prompt:
    │       │   ├── Run pairwise comparison (GenRM model)
    │       │   ├── Aggregate scores (tiebreaker, length bonuses)
    │       │   └── Resolve all N pending verify callers with their rewards
    │       └── Return this rollout’s reward
    └── /compare (batch)
        └── Compare N response_objs; return rewards + metrics (for scripts/tests)

Config: genrm_compare config includes num_rollouts_per_prompt, genrm_model_server (name genrm_model), and comparison/aggregation options. No comparison_strategy in global config for rollout.
Data: For RLHF, provide num_rollouts_per_prompt rows per prompt (e.g. via num_repeats when loading data).

Testing

curl -s -X POST http://127.0.0.1:17795/compare \
  -H "Content-Type: application/json" \
  -d '{
    "conversation_history": [{"role": "user", "content": "What is SKILL?"}],
    "response_objs": [
      {"output": [{"type": "message", "content": [{"type": "output_text", "text": "SKILL is a verb meaning to kill."}]}]},
      {"output": [{"type": "message", "content": [{"type": "output_text", "text": "Skill refers to the ability to perform a task well."}]}]}
    ]
  }' | jq .

GenRM returns a response with reasoning and a final message containing JSON scores, e.g.:

{
  "rewards": [
    1.025,
    4.475
  ],
  "comparison_results": [
    {
      "response_i": 0,
      "response_j": 1,
      "judge_idx": 0,
      "score_1": 1.0,
      "score_2": 5.0,
      "ranking": 6.0
    },
    {
      "response_i": 1,
      "response_j": 0,
      "judge_idx": 0,
      "score_1": 4.0,
      "score_2": 1.0,
      "ranking": 1.0
    }
  ],
  "metrics": {
    "mean_individual_score": 2.75,
    "std_individual_score": 1.7853571071357126,
    "tiebreak_usage_rate": 0.0
  }
}

Unit tests cover genrm_compare (verify stub when N≤1, cohort logic, compare), utils (prompt key, parsing, aggregation), and comparison_strategies (batch client and helpers).

copy-pr-bot · 2026-02-12T00:42:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

…e server. Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

ffrujeri changed the title ~~feat: Adds GenRM pairwise comparison resource servert to support RLHF training workflows.~~ feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows. Feb 13, 2026

ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from 43da6c4 to 85f39dd Compare February 14, 2026 23:48

ffrujeri force-pushed the ffrujeri/genrm-model branch from c22967f to dd172a0 Compare February 18, 2026 02:33

bxyu-nvidia linked an issue Feb 18, 2026 that may be closed by this pull request

feat: Reward model support #516

Open

ffrujeri force-pushed the ffrujeri/genrm-model branch from dd172a0 to 7458914 Compare February 18, 2026 16:58

ffrujeri added 4 commits February 18, 2026 16:59

Add genrm_compare resource server.

e237462

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Add comparison_strategies.py module as part of the resources server.

4782822

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Add RLHF to config_type.py.

cf16fbb

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Refactor genrm_compare to use local_genrm.

3000055

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from 1d043b6 to 3000055 Compare February 18, 2026 16:59

ffrujeri added 2 commits February 18, 2026 17:19

Update genrm_compare for the latest genrm_model contract.

4f12f84

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Refactor cohort-based verify logic to be in the genrm_compare resourc…

6a6079a

…e server. Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

ffrujeri marked this pull request as ready for review February 18, 2026 21:43

ffrujeri added 4 commits February 19, 2026 18:51

Fix linting issues.

acfeeb7

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Fix unit tests.

a017594

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Add two more examples for data validation.

5993cc4

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Update example_data.

ec2a508

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows.#679

feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows.#679
ffrujeri wants to merge 10 commits intoffrujeri/genrm-modelfrom
ffrujeri/genrm-compare-rs

ffrujeri commented Feb 12, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

ffrujeri commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GenRM Compare Resource Server & Cohort-Based Verify

What does this PR do?

Issues

Summary

Key features

Architecture

Testing

Uh oh!

copy-pr-bot bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ffrujeri commented Feb 12, 2026 •

edited

Loading