[MISC] Speed up rigid constraint solver init.#2521
[MISC] Speed up rigid constraint solver init.#2521duburcqa merged 13 commits intoGenesis-Embodied-AI:mainfrom
Conversation
961f9c3 to
8ae30ab
Compare
|
this is missing benchmark comparison right? |
Waiting for more information about the performance benchmark.
|
🔴 Benchmark Regression Detected ➡️ Report |
8c9b368 to
28dc6fc
Compare
Convert func_solve_init from a plain @qd.kernel to a @qd.perf_dispatch, and register func_solve_init_decomposed for CUDA backend. This breaks the monolithic init into 8 separate kernel launches: 1. _kernel_init_warmstart (warmstart selection, ndrange dofs) 2. _kernel_init_Ma (Ma = M @ qacc, ndrange dofs) 3. _kernel_init_Jaref (Jaref = -aref + J @ qacc, ndrange constraints) 4. _kernel_init_improved (set improved flags) 5. _kernel_init_update_constraint (wraps monolith for FP match) 6. Newton hessian (conditional, reuses existing kernel) 7. _kernel_init_update_gradient (wraps monolith tiled gradient) 8. _kernel_init_search (search = -Mgrad, ndrange dofs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7a527cc to
889024d
Compare
|
🔴 Benchmark Regression Detected ➡️ Report |
|
|
|
+9.1% for dex_hand field 🙌 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4f2a89feb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Please explain why this PR is supposed to be useful. The original implementation is basically the same but with one single monolithic kernel. This modification significantly increases the maintenance burden and code duplication so it has to be extremely well motivated beyond pure performance benchmarks, otherwise it is not going to be merge. |
This PR increases dex_hand FPS by 9.1%, which boosts our delta with mjwarp from +2% to +11%. a 6 fold increase. You can argue details about how we achieve the contents of this PR, but i disagree strongly with refusing to merge a PR that improves the performance to such an extent on the basis of "performance improvement is not a sufficient reason to merge a PR". |
This has always been my policy. I value maintainability and long term improvement way more than marginal speedup. Here we are not talking about +50% speedup globally but +9% on a single benchmark. Fragmentation of the codebase creates significant technical debts and it would be nice to think about longer term approach to reconcile CPU and GPU kernel implementations rather than keeping diverging indefinitely. This is not sustainable. Typically, I already made it clear that I consider beneficial removing a specialisation that only brings 5% speedup. This is how hibernation went dead basically, along with differentiable simulation. Improving performance is important, but it must be done in a way that is consistent with our long term goal of building a high quality and low maintenance codebase given the limited man power we have. |
|
@erizmr please factorize this PR so that common code goes into @qd.func shared between teh two approaches. You can see that I ddi the same thing for the solver body decomp PR. Concretely, for example: ... should all be inside a qd.func, shared between both appraoches Similar for: (You can call a qd.func from a qd.kernel. See how we did this for solver body). @duburcqa please consider making constructive suggestsion on how to maek the PR mergeable, rather than just announcing you won't merge it. |
Let's determine first if it makes sense to merge this in the first place. You did not answered my one and single question: "Please explain why this PR is supposed to be useful. The original implementation is basically the same but with one single monolithic kernel.". It was not rhetorical, it is a real question for which I'm expecting an answer. Then, based on this, we could determine how to move forward. |
It is useful because it improves FPS. This is our goal in optimizing kernels. As stated, we can debate how to make this PR acceptably clean and deduplicated, but arguging that FPS improvement is not sufficient reason I do not agree with. |
|
Basically, if I understand correclty, we are apploying the same process to solver init that we did on solver body: adding a decomp version alongside the monolith version. I too agree that I'd like to reduce code deduplication. I already found this PR hard to review because it copies and pastes tons of code. We don't disagree on the code duplication obseraiotn. |
|
By the way, unlike solver body, we are using a heuristic to choose btween decomp and mono, because
It's still the same architecture though, ie monolith vs decomp. |
The reason for monolith vs decomposition was clear for the solver body, but it is not clear for init. We should already be able to fully leverage parallelisation since all qd.func calls are top level offloaded tasks. That is why I don't understand. I must be missing something. |
I'm not asking about the motivation of this PR, but the motivation for this specific implementation. From my understanding it should be possible to just improve the existing implementation without adding a new one. |
Ah, I was actually wondering the same thing tbh. |
I see. I can help to figure this out if needed. |
More context about the motivation here, the monolith |
This opportunities could be implemented in the original monolith init kernel directly? Or am I wrong? |
I got your point. I think it is possible to do that in monolith directly after second check. Will make an updated version :) |
Just to check, because it's not actually monolith right? It's a single quadrants kernel, but contains multiple gpu kernels right? By comparison, the solver kernel has a single toplevel for loop AFAIK? |
Exactly. |
No description provided.