feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows.#679
Open
ffrujeri wants to merge 10 commits intoffrujeri/genrm-modelfrom
Open
feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows.#679ffrujeri wants to merge 10 commits intoffrujeri/genrm-modelfrom
ffrujeri wants to merge 10 commits intoffrujeri/genrm-modelfrom
Conversation
43da6c4 to
85f39dd
Compare
c22967f to
dd172a0
Compare
dd172a0 to
7458914
Compare
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
1d043b6 to
3000055
Compare
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
…e server. Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GenRM Compare Resource Server & Cohort-Based Verify
What does this PR do?
This PR adds a production-ready Resource Server for comparing multiple candidate responses using GenRM models, and moves RLHF-specific reward logic (cohort buffering and comparison) into that server so that rollout collection and consumer libraries (e.g. nemo-rl) stay generic.
Issues
Summary
In RLHF, rewards are relative to other rollouts for the same task (e.g. same prompt), not independent. This PR addresses that by:
/verifyendpoint buffers rollouts by prompt (and optional principle). Whennum_rollouts_per_promptrollouts have been received for a prompt, it runs pairwise comparison, aggregates scores, and returns the appropriate reward to each of the N callers. Callers naturally “wait” until their cohort is complete via the async verify flow./run”. The agent calls the resources server’s/verifywith the response; genrm_compare owns all buffering and comparison. No comparison strategy or prompt buffering inrollout_collection.py.Key features
num_rollouts_per_prompt; verify buffers by prompt (and principle), runs comparison when cohort is full, distributes rewards./compareAPI: Direct comparison of Nresponse_objs(e.g. for scripts or tests).genrm_model(server namegenrm_model; custom rolesresponse_1,response_2,principle).Architecture
num_rollouts_per_prompt,genrm_model_server(namegenrm_model), and comparison/aggregation options. Nocomparison_strategyin global config for rollout.num_rollouts_per_promptrows per prompt (e.g. vianum_repeatswhen loading data).Testing
GenRM returns a response with reasoning and a final message containing JSON scores, e.g.:
{ "rewards": [ 1.025, 4.475 ], "comparison_results": [ { "response_i": 0, "response_j": 1, "judge_idx": 0, "score_1": 1.0, "score_2": 5.0, "ranking": 6.0 }, { "response_i": 1, "response_j": 0, "judge_idx": 0, "score_1": 4.0, "score_2": 1.0, "ranking": 1.0 } ], "metrics": { "mean_individual_score": 2.75, "std_individual_score": 1.7853571071357126, "tiebreak_usage_rate": 0.0 } }Unit tests cover genrm_compare (verify stub when N≤1, cohort logic, compare), utils (prompt key, parsing, aggregation), and comparison_strategies (batch client and helpers).