What feature or enhancement are you proposing?
Enable the tiled Cholesky kernels (func_cholesky_factor_direct_tiled, func_cholesky_solve_tiled in constraint/solver.py) to work with systems that have more than 96 DOFs by opting in to extended GPU shared memory via cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemoryBytes, size).
Currently, rigid_solver.py calculates max_n_threads based on a hardcoded max_shared_mem = 48.0 KB, which limits tiled Cholesky to n_dofs ≤ 96 (f32). When n_dofs exceeds this threshold, enable_tiled_cholesky_hessian is set to False and the solver silently falls back to the single-thread-per-environment batch path.
Motivation
Modern NVIDIA GPUs support substantially more shared memory per block than the 48KB default. Unlocking this requires calling cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemoryBytes, needed_size) before kernel launch, which Quadrants does not currently do. As a result, the tiled Cholesky optimization — which is specifically designed to accelerate the Newton solver — is unavailable for any articulated system exceeding 96 DOFs, precisely the systems where this optimization matters most.
Potential Benefit
1.Broader applicability: soft robots, humanoids, legged robots, and multi-body systems commonly exceed 96 DOFs. These are the scenarios where Newton solver performance is critical.
2.No algorithm changes required: The existing func_cholesky_factor_direct_tiled and func_cholesky_solve_tiled implementations are already correct and well-optimized — only need the runtime shared memory limit raised.
3.Scales with hardware: Querying CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN at runtime instead of hardcoding 48KB allows Genesis to automatically take advantage of the available hardware on any GPU.
4.Also benefits mass matrix tiling: enable_tiled_cholesky_mass_matrix in forward_dynamics.py has the same 96-DOF limitation and would benefit equally.
What is the expected outcome of the implementation work?
1.Quadrants exposes an API (or Genesis adds a workaround) to call cudaFuncSetAttribute before launching kernels that require >48KB shared memory.
2.rigid_solver.py queries CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN at initialization and uses the result to compute max_n_threads and max_shared_mem, rather than hardcoding 48KB.
3.Systems with DOF counts between 96 and the hardware limit automatically benefit from tiled Cholesky acceleration on capable GPUs.
Additional information
No response
What feature or enhancement are you proposing?
Enable the tiled Cholesky kernels (func_cholesky_factor_direct_tiled, func_cholesky_solve_tiled in constraint/solver.py) to work with systems that have more than 96 DOFs by opting in to extended GPU shared memory via cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemoryBytes, size).
Currently, rigid_solver.py calculates max_n_threads based on a hardcoded max_shared_mem = 48.0 KB, which limits tiled Cholesky to n_dofs ≤ 96 (f32). When n_dofs exceeds this threshold, enable_tiled_cholesky_hessian is set to False and the solver silently falls back to the single-thread-per-environment batch path.
Motivation
Modern NVIDIA GPUs support substantially more shared memory per block than the 48KB default. Unlocking this requires calling cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemoryBytes, needed_size) before kernel launch, which Quadrants does not currently do. As a result, the tiled Cholesky optimization — which is specifically designed to accelerate the Newton solver — is unavailable for any articulated system exceeding 96 DOFs, precisely the systems where this optimization matters most.
Potential Benefit
1.Broader applicability: soft robots, humanoids, legged robots, and multi-body systems commonly exceed 96 DOFs. These are the scenarios where Newton solver performance is critical.
2.No algorithm changes required: The existing func_cholesky_factor_direct_tiled and func_cholesky_solve_tiled implementations are already correct and well-optimized — only need the runtime shared memory limit raised.
3.Scales with hardware: Querying CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN at runtime instead of hardcoding 48KB allows Genesis to automatically take advantage of the available hardware on any GPU.
4.Also benefits mass matrix tiling: enable_tiled_cholesky_mass_matrix in forward_dynamics.py has the same 96-DOF limitation and would benefit equally.
What is the expected outcome of the implementation work?
1.Quadrants exposes an API (or Genesis adds a workaround) to call cudaFuncSetAttribute before launching kernels that require >48KB shared memory.
2.rigid_solver.py queries CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN at initialization and uses the result to compute max_n_threads and max_shared_mem, rather than hardcoding 48KB.
3.Systems with DOF counts between 96 and the hardware limit automatically benefit from tiled Cholesky acceleration on capable GPUs.
Additional information
No response