Skip to content

[Feature]: Support tiled Cholesky for systems with >96 DOFs by opting in to extended GPU shared memory (>48KB) #2626

@Huan-qun

Description

@Huan-qun

What feature or enhancement are you proposing?

Enable the tiled Cholesky kernels (func_cholesky_factor_direct_tiled, func_cholesky_solve_tiled in constraint/solver.py) to work with systems that have more than 96 DOFs by opting in to extended GPU shared memory via cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemoryBytes, size).

Currently, rigid_solver.py calculates max_n_threads based on a hardcoded max_shared_mem = 48.0 KB, which limits tiled Cholesky to n_dofs ≤ 96 (f32). When n_dofs exceeds this threshold, enable_tiled_cholesky_hessian is set to False and the solver silently falls back to the single-thread-per-environment batch path.

Motivation

Modern NVIDIA GPUs support substantially more shared memory per block than the 48KB default. Unlocking this requires calling cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemoryBytes, needed_size) before kernel launch, which Quadrants does not currently do. As a result, the tiled Cholesky optimization — which is specifically designed to accelerate the Newton solver — is unavailable for any articulated system exceeding 96 DOFs, precisely the systems where this optimization matters most.

Potential Benefit

1.Broader applicability: soft robots, humanoids, legged robots, and multi-body systems commonly exceed 96 DOFs. These are the scenarios where Newton solver performance is critical.

2.No algorithm changes required: The existing func_cholesky_factor_direct_tiled and func_cholesky_solve_tiled implementations are already correct and well-optimized — only need the runtime shared memory limit raised.

3.Scales with hardware: Querying CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN at runtime instead of hardcoding 48KB allows Genesis to automatically take advantage of the available hardware on any GPU.

4.Also benefits mass matrix tiling: enable_tiled_cholesky_mass_matrix in forward_dynamics.py has the same 96-DOF limitation and would benefit equally.

What is the expected outcome of the implementation work?

1.Quadrants exposes an API (or Genesis adds a workaround) to call cudaFuncSetAttribute before launching kernels that require >48KB shared memory.

2.rigid_solver.py queries CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN at initialization and uses the result to compute max_n_threads and max_shared_mem, rather than hardcoding 48KB.

3.Systems with DOF counts between 96 and the hardware limit automatically benefit from tiled Cholesky acceleration on capable GPUs.

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions