Skip to content

[MISC] Speed up rigid constraint solver init.#2521

Merged
duburcqa merged 13 commits intoGenesis-Embodied-AI:mainfrom
erizmr:mingrui/260309/solver_opt_func_init
Apr 9, 2026
Merged

[MISC] Speed up rigid constraint solver init.#2521
duburcqa merged 13 commits intoGenesis-Embodied-AI:mainfrom
erizmr:mingrui/260309/solver_opt_func_init

Conversation

@erizmr
Copy link
Copy Markdown
Contributor

@erizmr erizmr commented Mar 9, 2026

No description provided.

@erizmr erizmr marked this pull request as draft March 9, 2026 21:22
@erizmr erizmr marked this pull request as ready for review March 10, 2026 00:17
@erizmr erizmr force-pushed the mingrui/260309/solver_opt_func_init branch from 961f9c3 to 8ae30ab Compare March 11, 2026 16:47
duburcqa
duburcqa previously approved these changes Mar 12, 2026
@hughperkins
Copy link
Copy Markdown
Collaborator

this is missing benchmark comparison right?

@duburcqa duburcqa self-requested a review March 13, 2026 12:56
@duburcqa duburcqa dismissed their stale review March 13, 2026 12:58

Waiting for more information about the performance benchmark.

@github-actions
Copy link
Copy Markdown

🔴 Benchmark Regression Detected ➡️ Report

@hughperkins hughperkins marked this pull request as draft March 19, 2026 17:45
@erizmr erizmr force-pushed the mingrui/260309/solver_opt_func_init branch from 8c9b368 to 28dc6fc Compare March 28, 2026 17:36
erizmr and others added 4 commits March 31, 2026 17:28
Convert func_solve_init from a plain @qd.kernel to a @qd.perf_dispatch,
and register func_solve_init_decomposed for CUDA backend. This breaks
the monolithic init into 8 separate kernel launches:
1. _kernel_init_warmstart (warmstart selection, ndrange dofs)
2. _kernel_init_Ma (Ma = M @ qacc, ndrange dofs)
3. _kernel_init_Jaref (Jaref = -aref + J @ qacc, ndrange constraints)
4. _kernel_init_improved (set improved flags)
5. _kernel_init_update_constraint (wraps monolith for FP match)
6. Newton hessian (conditional, reuses existing kernel)
7. _kernel_init_update_gradient (wraps monolith tiled gradient)
8. _kernel_init_search (search = -Mgrad, ndrange dofs)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@erizmr erizmr force-pushed the mingrui/260309/solver_opt_func_init branch from 7a527cc to 889024d Compare March 31, 2026 21:24
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🔴 Benchmark Regression Detected ➡️ Report

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 2, 2026

⚠️ Abnormal Benchmark Result Detected ➡️ Report

@hughperkins
Copy link
Copy Markdown
Collaborator

+9.1% for dex_hand field 🙌

@erizmr erizmr marked this pull request as ready for review April 2, 2026 03:51
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4f2a89feb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread genesis/engine/solvers/rigid/constraint/solver.py Outdated
@duburcqa
Copy link
Copy Markdown
Collaborator

duburcqa commented Apr 2, 2026

Please explain why this PR is supposed to be useful. The original implementation is basically the same but with one single monolithic kernel. This modification significantly increases the maintenance burden and code duplication so it has to be extremely well motivated beyond pure performance benchmarks, otherwise it is not going to be merge.

@hughperkins
Copy link
Copy Markdown
Collaborator

Please explain why this PR is supposed to be useful. The original implementation is basically the same but with one single monolithic kernel. This modification significantly increases the maintenance burden and code duplication so it has to be extremely well motivated beyond pure performance benchmarks, otherwise it is not going to be merge.

This PR increases dex_hand FPS by 9.1%, which boosts our delta with mjwarp from +2% to +11%. a 6 fold increase.

You can argue details about how we achieve the contents of this PR, but i disagree strongly with refusing to merge a PR that improves the performance to such an extent on the basis of "performance improvement is not a sufficient reason to merge a PR".

@duburcqa
Copy link
Copy Markdown
Collaborator

duburcqa commented Apr 2, 2026

I disagree strongly with refusing to merge a PR that improves the performance to such an extent on the basis of "performance improvement is not a sufficient reason to merge a PR".

This has always been my policy. I value maintainability and long term improvement way more than marginal speedup. Here we are not talking about +50% speedup globally but +9% on a single benchmark. Fragmentation of the codebase creates significant technical debts and it would be nice to think about longer term approach to reconcile CPU and GPU kernel implementations rather than keeping diverging indefinitely. This is not sustainable. Typically, I already made it clear that I consider beneficial removing a specialisation that only brings 5% speedup. This is how hibernation went dead basically, along with differentiable simulation. Improving performance is important, but it must be done in a way that is consistent with our long term goal of building a high quality and low maintenance codebase given the limited man power we have.

@hughperkins
Copy link
Copy Markdown
Collaborator

hughperkins commented Apr 2, 2026

@erizmr please factorize this PR so that common code goes into @qd.func shared between teh two approaches.

You can see that I ddi the same thing for the solver body decomp PR.

Concretely, for example:

    """Select qacc from warmstart or acc_smooth, parallelized over (dof, env)."""
    n_dofs = dofs_state.acc_smooth.shape[0]
    _B = dofs_state.acc_smooth.shape[1]

    qd.loop_config(serialize=static_rigid_sim_config.para_level < gs.PARA_LEVEL.ALL)
    for i_d, i_b in qd.ndrange(n_dofs, _B):
        if constraint_state.n_constraints[i_b] > 0 and constraint_state.is_warmstart[i_b]:
            constraint_state.qacc[i_d, i_b] = constraint_state.qacc_ws[i_d, i_b]
        else:
            constraint_state.qacc[i_d, i_b] = dofs_state.acc_smooth[i_d, i_b]

... should all be inside a qd.func, shared between both appraoches

Similar for:

    """Compute Jaref = -aref + J @ qacc, parallelized over (constraint, env)."""
    len_constraints = constraint_state.Jaref.shape[0]
    n_dofs = constraint_state.jac.shape[1]
    _B = constraint_state.grad.shape[1]

    qd.loop_config(serialize=static_rigid_sim_config.para_level < gs.PARA_LEVEL.ALL)
    for i_c, i_b in qd.ndrange(len_constraints, _B):
        if i_c < constraint_state.n_constraints[i_b]:
            Jaref = -constraint_state.aref[i_c, i_b]
            if qd.static(static_rigid_sim_config.sparse_solve):
                for i_d_ in range(constraint_state.jac_n_relevant_dofs[i_c, i_b]):
                    i_d = constraint_state.jac_relevant_dofs[i_c, i_d_, i_b]
                    Jaref += constraint_state.jac[i_c, i_d, i_b] * constraint_state.qacc[i_d, i_b]
            else:
                for i_d in range(n_dofs):
                    Jaref += constraint_state.jac[i_c, i_d, i_b] * constraint_state.qacc[i_d, i_b]
            constraint_state.Jaref[i_c, i_b] = Jaref

(You can call a qd.func from a qd.kernel. See how we did this for solver body).

@duburcqa please consider making constructive suggestsion on how to maek the PR mergeable, rather than just announcing you won't merge it.

@duburcqa
Copy link
Copy Markdown
Collaborator

duburcqa commented Apr 2, 2026

@duburcqa please consider making constructive suggestsion on how to maek the PR mergeable, rather than just announcing you won't merge it.

Let's determine first if it makes sense to merge this in the first place. You did not answered my one and single question: "Please explain why this PR is supposed to be useful. The original implementation is basically the same but with one single monolithic kernel.". It was not rhetorical, it is a real question for which I'm expecting an answer. Then, based on this, we could determine how to move forward.

@hughperkins
Copy link
Copy Markdown
Collaborator

@duburcqa please consider making constructive suggestsion on how to maek the PR mergeable, rather than just announcing you won't merge it.

Let's determine first if it makes sense to merge this in the first place. You did not answered my one and single question: "Please explain why this PR is supposed to be useful. The original implementation is basically the same but with one single monolithic kernel.". It was not rhetorical, it is a real question for which I'm expecting an answer. Then, based on this, we could determine how to move forward.

It is useful because it improves FPS. This is our goal in optimizing kernels.

As stated, we can debate how to make this PR acceptably clean and deduplicated, but arguging that FPS improvement is not sufficient reason I do not agree with.

@hughperkins
Copy link
Copy Markdown
Collaborator

Basically, if I understand correclty, we are apploying the same process to solver init that we did on solver body: adding a decomp version alongside the monolith version.

I too agree that I'd like to reduce code deduplication. I already found this PR hard to review because it copies and pastes tons of code.

We don't disagree on the code duplication obseraiotn.

@hughperkins
Copy link
Copy Markdown
Collaborator

By the way, unlike solver body, we are using a heuristic to choose btween decomp and mono, because

  1. such a heuristic appears to exclude successfully all regressions
  2. avoids perf dispatch pain

It's still the same architecture though, ie monolith vs decomp.

@duburcqa
Copy link
Copy Markdown
Collaborator

duburcqa commented Apr 2, 2026

It's still the same architecture though, ie monolith vs decomp.

The reason for monolith vs decomposition was clear for the solver body, but it is not clear for init. We should already be able to fully leverage parallelisation since all qd.func calls are top level offloaded tasks. That is why I don't understand. I must be missing something.

@duburcqa
Copy link
Copy Markdown
Collaborator

duburcqa commented Apr 2, 2026

It is useful because it improves FPS. This is our goal in optimizing kernels.

I'm not asking about the motivation of this PR, but the motivation for this specific implementation. From my understanding it should be possible to just improve the existing implementation without adding a new one.

@hughperkins
Copy link
Copy Markdown
Collaborator

It's still the same architecture though, ie monolith vs decomp.

The reason for monolith vs decomposition was clear for the solver body, but it is not clear for init. We should already be able to fully leverage parallelisation since all qd.func calls are top level offloaded tasks. That is why I don't understand. I must be missing something.

Ah, I was actually wondering the same thing tbh.

@duburcqa
Copy link
Copy Markdown
Collaborator

duburcqa commented Apr 2, 2026

Ah, I was actually wondering the same thing tbh.

I see. I can help to figure this out if needed.

@erizmr
Copy link
Copy Markdown
Contributor Author

erizmr commented Apr 2, 2026

It's still the same architecture though, ie monolith vs decomp.

The reason for monolith vs decomposition was clear for the solver body, but it is not clear for init. We should already be able to fully leverage parallelisation since all qd.func calls are top level offloaded tasks. That is why I don't understand. I must be missing something.

More context about the motivation here, the monolith func_init only do parallelization across envs, however for kernels such as initialize_Jaref, each thread need to serially traversal all constraints, which can be parallelized. This PR is targeting on these opportunites.

@duburcqa
Copy link
Copy Markdown
Collaborator

duburcqa commented Apr 2, 2026

More context about the motivation here, the monolith func_init only do parallelization across envs, however for kernels such as initialize_Jaref, each thread need to serially traversal all constraints, which can be parallelized. This PR is targeting on these opportunites.

This opportunities could be implemented in the original monolith init kernel directly? Or am I wrong?

@erizmr
Copy link
Copy Markdown
Contributor Author

erizmr commented Apr 2, 2026

More context about the motivation here, the monolith func_init only do parallelization across envs, however for kernels such as initialize_Jaref, each thread need to serially traversal all constraints, which can be parallelized. This PR is targeting on these opportunites.

This opportunities could be implemented in the original monolith init kernel directly? Or am I wrong?

I got your point. I think it is possible to do that in monolith directly after second check. Will make an updated version :)

@hughperkins
Copy link
Copy Markdown
Collaborator

More context about the motivation here, the monolith func_init only do parallelization across envs, however for kernels such as initialize_Jaref, each thread need to serially traversal all constraints, which can be parallelized. This PR is targeting on these opportunites.

This opportunities could be implemented in the original monolith init kernel directly? Or am I wrong?

I got your point. I think it is possible to do that in monolith directly after second check. Will make an updated version :)

Just to check, because it's not actually monolith right? It's a single quadrants kernel, but contains multiple gpu kernels right?

By comparison, the solver kernel has a single toplevel for loop AFAIK?

@duburcqa
Copy link
Copy Markdown
Collaborator

duburcqa commented Apr 2, 2026

Just to check, because it's not actually monolith right? It's a single quadrants kernel, but contains multiple gpu kernels right? By comparison, the solver kernel has a single toplevel for loop AFAIK?

Exactly.

duburcqa
duburcqa previously approved these changes Apr 6, 2026
Comment thread genesis/engine/solvers/rigid/rigid_solver.py Outdated
@duburcqa duburcqa self-requested a review April 6, 2026 21:07
@duburcqa duburcqa dismissed their stale review April 6, 2026 21:07

CUDA-only

@duburcqa duburcqa changed the title [MISC] Constraint solver func init optimization [MISC] Constraint solver func init optimization. Apr 8, 2026
@duburcqa duburcqa changed the title [MISC] Constraint solver func init optimization. [MISC] Speed up rigid constraint solver init. Apr 8, 2026
@duburcqa duburcqa merged commit ad7671a into Genesis-Embodied-AI:main Apr 9, 2026
24 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants