[MISC] Speed up rigid constraint solver init. by erizmr · Pull Request #2521 · Genesis-Embodied-AI/Genesis

erizmr · 2026-03-09T21:21:48Z

No description provided.

hughperkins · 2026-03-12T18:06:00Z

this is missing benchmark comparison right?

Waiting for more information about the performance benchmark.

github-actions · 2026-03-13T14:46:38Z

🔴 Benchmark Regression Detected ➡️ Report

Convert func_solve_init from a plain @qd.kernel to a @qd.perf_dispatch, and register func_solve_init_decomposed for CUDA backend. This breaks the monolithic init into 8 separate kernel launches: 1. _kernel_init_warmstart (warmstart selection, ndrange dofs) 2. _kernel_init_Ma (Ma = M @ qacc, ndrange dofs) 3. _kernel_init_Jaref (Jaref = -aref + J @ qacc, ndrange constraints) 4. _kernel_init_improved (set improved flags) 5. _kernel_init_update_constraint (wraps monolith for FP match) 6. Newton hessian (conditional, reuses existing kernel) 7. _kernel_init_update_gradient (wraps monolith tiled gradient) 8. _kernel_init_search (search = -Mgrad, ndrange dofs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-04-01T00:27:21Z

🔴 Benchmark Regression Detected ➡️ Report

github-actions · 2026-04-02T02:02:00Z

⚠️ Abnormal Benchmark Result Detected ➡️ Report

hughperkins · 2026-04-02T02:03:14Z

+9.1% for dex_hand field 🙌

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4f2a89feb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

duburcqa · 2026-04-02T05:36:27Z

Please explain why this PR is supposed to be useful. The original implementation is basically the same but with one single monolithic kernel. This modification significantly increases the maintenance burden and code duplication so it has to be extremely well motivated beyond pure performance benchmarks, otherwise it is not going to be merge.

hughperkins · 2026-04-02T07:56:18Z

Please explain why this PR is supposed to be useful. The original implementation is basically the same but with one single monolithic kernel. This modification significantly increases the maintenance burden and code duplication so it has to be extremely well motivated beyond pure performance benchmarks, otherwise it is not going to be merge.

This PR increases dex_hand FPS by 9.1%, which boosts our delta with mjwarp from +2% to +11%. a 6 fold increase.

You can argue details about how we achieve the contents of this PR, but i disagree strongly with refusing to merge a PR that improves the performance to such an extent on the basis of "performance improvement is not a sufficient reason to merge a PR".

duburcqa · 2026-04-02T08:28:50Z

I disagree strongly with refusing to merge a PR that improves the performance to such an extent on the basis of "performance improvement is not a sufficient reason to merge a PR".

This has always been my policy. I value maintainability and long term improvement way more than marginal speedup. Here we are not talking about +50% speedup globally but +9% on a single benchmark. Fragmentation of the codebase creates significant technical debts and it would be nice to think about longer term approach to reconcile CPU and GPU kernel implementations rather than keeping diverging indefinitely. This is not sustainable. Typically, I already made it clear that I consider beneficial removing a specialisation that only brings 5% speedup. This is how hibernation went dead basically, along with differentiable simulation. Improving performance is important, but it must be done in a way that is consistent with our long term goal of building a high quality and low maintenance codebase given the limited man power we have.

hughperkins · 2026-04-02T08:34:10Z

@erizmr please factorize this PR so that common code goes into @qd.func shared between teh two approaches.

You can see that I ddi the same thing for the solver body decomp PR.

Concretely, for example:

    """Select qacc from warmstart or acc_smooth, parallelized over (dof, env)."""
    n_dofs = dofs_state.acc_smooth.shape[0]
    _B = dofs_state.acc_smooth.shape[1]

    qd.loop_config(serialize=static_rigid_sim_config.para_level < gs.PARA_LEVEL.ALL)
    for i_d, i_b in qd.ndrange(n_dofs, _B):
        if constraint_state.n_constraints[i_b] > 0 and constraint_state.is_warmstart[i_b]:
            constraint_state.qacc[i_d, i_b] = constraint_state.qacc_ws[i_d, i_b]
        else:
            constraint_state.qacc[i_d, i_b] = dofs_state.acc_smooth[i_d, i_b]

... should all be inside a qd.func, shared between both appraoches

Similar for:

    """Compute Jaref = -aref + J @ qacc, parallelized over (constraint, env)."""
    len_constraints = constraint_state.Jaref.shape[0]
    n_dofs = constraint_state.jac.shape[1]
    _B = constraint_state.grad.shape[1]

    qd.loop_config(serialize=static_rigid_sim_config.para_level < gs.PARA_LEVEL.ALL)
    for i_c, i_b in qd.ndrange(len_constraints, _B):
        if i_c < constraint_state.n_constraints[i_b]:
            Jaref = -constraint_state.aref[i_c, i_b]
            if qd.static(static_rigid_sim_config.sparse_solve):
                for i_d_ in range(constraint_state.jac_n_relevant_dofs[i_c, i_b]):
                    i_d = constraint_state.jac_relevant_dofs[i_c, i_d_, i_b]
                    Jaref += constraint_state.jac[i_c, i_d, i_b] * constraint_state.qacc[i_d, i_b]
            else:
                for i_d in range(n_dofs):
                    Jaref += constraint_state.jac[i_c, i_d, i_b] * constraint_state.qacc[i_d, i_b]
            constraint_state.Jaref[i_c, i_b] = Jaref

(You can call a qd.func from a qd.kernel. See how we did this for solver body).

@duburcqa please consider making constructive suggestsion on how to maek the PR mergeable, rather than just announcing you won't merge it.

duburcqa · 2026-04-02T08:37:57Z

@duburcqa please consider making constructive suggestsion on how to maek the PR mergeable, rather than just announcing you won't merge it.

Let's determine first if it makes sense to merge this in the first place. You did not answered my one and single question: "Please explain why this PR is supposed to be useful. The original implementation is basically the same but with one single monolithic kernel.". It was not rhetorical, it is a real question for which I'm expecting an answer. Then, based on this, we could determine how to move forward.

hughperkins · 2026-04-02T08:39:43Z

@duburcqa please consider making constructive suggestsion on how to maek the PR mergeable, rather than just announcing you won't merge it.

Let's determine first if it makes sense to merge this in the first place. You did not answered my one and single question: "Please explain why this PR is supposed to be useful. The original implementation is basically the same but with one single monolithic kernel.". It was not rhetorical, it is a real question for which I'm expecting an answer. Then, based on this, we could determine how to move forward.

It is useful because it improves FPS. This is our goal in optimizing kernels.

As stated, we can debate how to make this PR acceptably clean and deduplicated, but arguging that FPS improvement is not sufficient reason I do not agree with.

hughperkins · 2026-04-02T08:41:06Z

Basically, if I understand correclty, we are apploying the same process to solver init that we did on solver body: adding a decomp version alongside the monolith version.

I too agree that I'd like to reduce code deduplication. I already found this PR hard to review because it copies and pastes tons of code.

We don't disagree on the code duplication obseraiotn.

hughperkins · 2026-04-02T08:56:55Z

By the way, unlike solver body, we are using a heuristic to choose btween decomp and mono, because

such a heuristic appears to exclude successfully all regressions
avoids perf dispatch pain

It's still the same architecture though, ie monolith vs decomp.

duburcqa · 2026-04-02T09:10:22Z

It's still the same architecture though, ie monolith vs decomp.

The reason for monolith vs decomposition was clear for the solver body, but it is not clear for init. We should already be able to fully leverage parallelisation since all qd.func calls are top level offloaded tasks. That is why I don't understand. I must be missing something.

duburcqa · 2026-04-02T09:11:08Z

It is useful because it improves FPS. This is our goal in optimizing kernels.

I'm not asking about the motivation of this PR, but the motivation for this specific implementation. From my understanding it should be possible to just improve the existing implementation without adding a new one.

hughperkins · 2026-04-02T09:39:09Z

It's still the same architecture though, ie monolith vs decomp.

The reason for monolith vs decomposition was clear for the solver body, but it is not clear for init. We should already be able to fully leverage parallelisation since all qd.func calls are top level offloaded tasks. That is why I don't understand. I must be missing something.

Ah, I was actually wondering the same thing tbh.

duburcqa · 2026-04-02T09:45:19Z

Ah, I was actually wondering the same thing tbh.

I see. I can help to figure this out if needed.

erizmr · 2026-04-02T16:29:28Z

It's still the same architecture though, ie monolith vs decomp.

The reason for monolith vs decomposition was clear for the solver body, but it is not clear for init. We should already be able to fully leverage parallelisation since all qd.func calls are top level offloaded tasks. That is why I don't understand. I must be missing something.

More context about the motivation here, the monolith func_init only do parallelization across envs, however for kernels such as initialize_Jaref, each thread need to serially traversal all constraints, which can be parallelized. This PR is targeting on these opportunites.

duburcqa · 2026-04-02T16:35:20Z

More context about the motivation here, the monolith func_init only do parallelization across envs, however for kernels such as initialize_Jaref, each thread need to serially traversal all constraints, which can be parallelized. This PR is targeting on these opportunites.

This opportunities could be implemented in the original monolith init kernel directly? Or am I wrong?

erizmr · 2026-04-02T16:51:24Z

More context about the motivation here, the monolith func_init only do parallelization across envs, however for kernels such as initialize_Jaref, each thread need to serially traversal all constraints, which can be parallelized. This PR is targeting on these opportunites.

This opportunities could be implemented in the original monolith init kernel directly? Or am I wrong?

I got your point. I think it is possible to do that in monolith directly after second check. Will make an updated version :)

hughperkins · 2026-04-02T17:15:28Z

More context about the motivation here, the monolith func_init only do parallelization across envs, however for kernels such as initialize_Jaref, each thread need to serially traversal all constraints, which can be parallelized. This PR is targeting on these opportunites.

This opportunities could be implemented in the original monolith init kernel directly? Or am I wrong?

I got your point. I think it is possible to do that in monolith directly after second check. Will make an updated version :)

Just to check, because it's not actually monolith right? It's a single quadrants kernel, but contains multiple gpu kernels right?

By comparison, the solver kernel has a single toplevel for loop AFAIK?

duburcqa · 2026-04-02T17:30:59Z

Just to check, because it's not actually monolith right? It's a single quadrants kernel, but contains multiple gpu kernels right? By comparison, the solver kernel has a single toplevel for loop AFAIK?

Exactly.

CUDA-only

erizmr requested review from YilingQiao and duburcqa as code owners March 9, 2026 21:21

erizmr marked this pull request as draft March 9, 2026 21:22

erizmr mentioned this pull request Mar 9, 2026

[MISC] Speed up solver via linesearch/update_constraint/func_init optimizations #2445

Closed

7 tasks

erizmr marked this pull request as ready for review March 10, 2026 00:17

erizmr force-pushed the mingrui/260309/solver_opt_func_init branch from 961f9c3 to 8ae30ab Compare March 11, 2026 16:47

duburcqa previously approved these changes Mar 12, 2026

View reviewed changes

duburcqa self-requested a review March 13, 2026 12:56

hughperkins marked this pull request as draft March 19, 2026 17:45

erizmr force-pushed the mingrui/260309/solver_opt_func_init branch from 8c9b368 to 28dc6fc Compare March 28, 2026 17:36

erizmr and others added 4 commits March 31, 2026 17:28

fix reading field in python scope avoiding gpu-cpu sync

b081259

update

56e085d

update to use cuda graph

889024d

erizmr force-pushed the mingrui/260309/solver_opt_func_init branch from 7a527cc to 889024d Compare March 31, 2026 21:24

not use cuda graph on grad tests

5cda296

erizmr and others added 3 commits April 1, 2026 02:01

add a gpu thread saturation check

c2b210f

Merge branch 'main' into mingrui/260309/solver_opt_func_init

32d8972

make it macos compatible

a4f2a89

erizmr marked this pull request as ready for review April 2, 2026 03:51

claude Bot reviewed Apr 2, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread genesis/engine/solvers/rigid/constraint/solver.py Outdated

erizmr added 2 commits April 6, 2026 14:28

update into single kernel

a94c22a

add option

ee81190

duburcqa previously approved these changes Apr 6, 2026

View reviewed changes

duburcqa reviewed Apr 6, 2026

View reviewed changes

Comment thread genesis/engine/solvers/rigid/rigid_solver.py Outdated

duburcqa self-requested a review April 6, 2026 21:07

erizmr and others added 3 commits April 8, 2026 14:12

fix qacc bug

dad6440

Merge branch 'main' into mingrui/260309/solver_opt_func_init

7d81c4d

fix cuda only

e77d3ed

duburcqa approved these changes Apr 8, 2026

View reviewed changes

duburcqa changed the title ~~[MISC] Constraint solver func init optimization~~ [MISC] Constraint solver func init optimization. Apr 8, 2026

duburcqa changed the title ~~[MISC] Constraint solver func init optimization.~~ [MISC] Speed up rigid constraint solver init. Apr 8, 2026

duburcqa merged commit ad7671a into Genesis-Embodied-AI:main Apr 9, 2026
24 of 26 checks passed

EricKing626 mentioned this pull request Apr 24, 2026

func_solve_init optimization ROCm/Genesis#20

Open

Conversation

erizmr commented Mar 9, 2026 • edited by duburcqa Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughperkins commented Mar 12, 2026

Uh oh!

github-actions Bot commented Mar 13, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 2, 2026

Uh oh!

hughperkins commented Apr 2, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

duburcqa commented Apr 2, 2026

Uh oh!

hughperkins commented Apr 2, 2026

Uh oh!

duburcqa commented Apr 2, 2026

Uh oh!

hughperkins commented Apr 2, 2026 • edited by duburcqa Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

duburcqa commented Apr 2, 2026

Uh oh!

hughperkins commented Apr 2, 2026

Uh oh!

hughperkins commented Apr 2, 2026

Uh oh!

hughperkins commented Apr 2, 2026

Uh oh!

duburcqa commented Apr 2, 2026

Uh oh!

duburcqa commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughperkins commented Apr 2, 2026

Uh oh!

duburcqa commented Apr 2, 2026

Uh oh!

erizmr commented Apr 2, 2026

Uh oh!

duburcqa commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erizmr commented Apr 2, 2026

Uh oh!

hughperkins commented Apr 2, 2026

Uh oh!

duburcqa commented Apr 2, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erizmr commented Mar 9, 2026 •

edited by duburcqa

Loading

hughperkins commented Apr 2, 2026 •

edited by duburcqa

Loading

duburcqa commented Apr 2, 2026 •

edited

Loading

duburcqa commented Apr 2, 2026 •

edited

Loading