Skip to content

[MISC] Add support of opt-in shared memory for tiled hessian to improve performance.#2629

Merged
duburcqa merged 1 commit intoGenesis-Embodied-AI:mainfrom
duburcqa:optin_shared_memory
Apr 4, 2026
Merged

[MISC] Add support of opt-in shared memory for tiled hessian to improve performance.#2629
duburcqa merged 1 commit intoGenesis-Embodied-AI:mainfrom
duburcqa:optin_shared_memory

Conversation

@duburcqa
Copy link
Copy Markdown
Collaborator

@duburcqa duburcqa commented Mar 31, 2026

Related Issue

Resolves #2626

Checklist:

  • I read the CONTRIBUTING document.
  • I followed the Submitting Code Changes section of CONTRIBUTING document.
  • I tagged the title correctly (including BUG FIX/FEATURE/MISC/BREAKING)
  • I updated the documentation accordingly or no change is needed.
  • I tested my changes and added instructions on how to test it for reviewers.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@duburcqa duburcqa requested a review from YilingQiao as a code owner March 31, 2026 10:25
@duburcqa
Copy link
Copy Markdown
Collaborator Author

duburcqa commented Mar 31, 2026

This snippet is crashing on CUDA for now, preventing this PR to pass.

import quadrants as qd

qd.init(arch=qd.cuda, debug=False, cfg_optimization=False)

@qd.kernel
def func_solve_init(
    nt_H: qd.types.ndarray,
):
    BLOCK_DIM = qd.static(64)
    MAX_DOFS = qd.static(111)  # Slightly over 48Kb, 110 would pass

    n_dofs = nt_H.shape[1]
    n_dofs_2 = n_dofs**2
    n_lower_tri = n_dofs * (n_dofs + 1) // 2

    qd.loop_config(block_dim=BLOCK_DIM)
    for tid in range(BLOCK_DIM):
        H = qd.simt.block.SharedArray((MAX_DOFS, MAX_DOFS + 1), qd.f32)

        i_pair = tid
        while i_pair < n_lower_tri:
            i_d1 = qd.cast(qd.floor((qd.sqrt(qd.cast(8 * i_pair + 1, qd.f32)) - 1.0) / 2.0), qd.i32)
            if (i_d1 + 1) * (i_d1 + 2) // 2 <= i_pair:
                i_d1 = i_d1 + 1
            i_d2 = i_pair - i_d1 * (i_d1 + 1) // 2
            H[i_d1, i_d2] = nt_H[0, i_d1, i_d2]
            i_pair = i_pair + BLOCK_DIM

    qd.loop_config(block_dim=BLOCK_DIM)
    for tid in range(BLOCK_DIM):
        H = qd.simt.block.SharedArray((MAX_DOFS, MAX_DOFS + 1), qd.f32)

        i_flat = tid
        while i_flat < n_dofs_2:
            i_d1 = i_flat // n_dofs
            i_d2 = i_flat % n_dofs
            if i_d2 <= i_d1:
                H[i_d1, i_d2] = nt_H[0, i_d1, i_d2]
            i_flat = i_flat + BLOCK_DIM


nt_H = qd.ndarray(dtype=qd.f32, shape=(1, 102, 102))

func_solve_init(nt_H)

Fixed by Genesis-Embodied-AI/quadrants#442

@duburcqa duburcqa force-pushed the optin_shared_memory branch from c9baa9d to a5c8a92 Compare April 3, 2026 12:18
@duburcqa duburcqa closed this Apr 3, 2026
@duburcqa duburcqa reopened this Apr 3, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 3, 2026

⚠️ Abnormal Benchmark Result Detected ➡️ Report

@duburcqa duburcqa merged commit 76764bb into Genesis-Embodied-AI:main Apr 4, 2026
24 of 44 checks passed
@duburcqa duburcqa deleted the optin_shared_memory branch April 5, 2026 06:09
@v01dXYZ
Copy link
Copy Markdown

v01dXYZ commented Apr 5, 2026

This PR tests should have failed on amdgpu. Do you know how it didn't?

@duburcqa
Copy link
Copy Markdown
Collaborator Author

duburcqa commented Apr 5, 2026

This PR tests should have failed on amdgpu. Do you know how it didn't?

Why are you saying it should fail? Quadrant's unit tests are passing on amdgpu. We don't have amdgpu CI on Genesis for now, so it is hard to make sure everything works fine.

@v01dXYZ
Copy link
Copy Markdown

v01dXYZ commented Apr 5, 2026

@duburcqa Sorry, I thought there is a ROCM pipeline to check if if something don't work. There is none, so you could not have known. I didn't want to be rude and press you over that.

Are you interested by adding a CI pipeline that should be triggered at release (to minimize costs)?

I tried on Hot Aisle MI300 (2 USD/hr) as DigitalOcean don't have any availability anymore for MI300. If you are interested, the OS is Ubuntu 22.04.5 LTS - Linux 5.15.0 - ROCm 7.2.0. You'll have to install libegl1 to have the render works as well. The instance should be explicitly deleted after usage to not consume credits (there is an API but I didn't try it yet).

@duburcqa
Copy link
Copy Markdown
Collaborator Author

duburcqa commented Apr 5, 2026

That sounds neat! I will try that. How hard it is to setup a GitHub runner for it?

@v01dXYZ
Copy link
Copy Markdown

v01dXYZ commented Apr 5, 2026

I'll try it now, and I'll tell you if I was able to register a git runner easily.

@duburcqa
Copy link
Copy Markdown
Collaborator Author

duburcqa commented Apr 5, 2026

By the way, cost is not a blocker. What matters is having a datacenter graded AMD GPU that is easy to use as CI and play with interactively via ssh. If we have that, I’m keen to enable it on every pipeline.

@v01dXYZ
Copy link
Copy Markdown

v01dXYZ commented Apr 5, 2026

I'll document below how to set up an ephemeral runner using Hot Aisle.

Hot Aisle Provisioning

#!/bin/env bash

## /!\ you have to provide the env var TEAM
## $ TEAM=<your-team> ./script.sh

## Before anything, you have to:
##
## * create an Hot Aisle account
## * create a team
## * provide a SSH key to log into instances
## * create a API token

# this script is only to document the process
set -e

## Getting the CLI tool and configuring it
## ---------------------------------------

# Download Hot Aisle cli https://github.com/hotaisle/hotaisle-cli/
HOTAISLE_CLI=./hotaisle-cli-v0.8.17-linux-amd64
if [[ ! -f ${HOTAISLE_CLI} ]]; then
    curl -LO 'https://github.com/hotaisle/hotaisle-cli/releases/download/v0.8.17/hotaisle-cli-v0.8.17-linux-amd64.tar.gz'
    tar xvf hotaisle-cli-v0.8.17-linux-amd64.tar.gz
fi

if ! (${HOTAISLE_CLI} user get > /dev/null); then
    # Set API_TOKEN
    if [[ -z HOTAISLE_API_TOKEN ]]; then
        read -sp "api token: " HOTAISLE_API_TOKEN
    fi

    HOTAISLE_API_TOKEN=${HOTAISLE_API_TOKEN} ${HOTAISLE_CLI} config set token

    # sanity check
    ${HOTAISLE_CLI} user get > /dev/null
fi

## Provisioning
## ------------

# To check if there are available GPUs
${HOTAISLE_CLI} vm available --team ${TEAM}

${HOTAISLE_CLI} vm provision \
                --team ${TEAM} \
                --gpu-count 1 \
                --gpu-model "MI300X" \
                --user-data-url "url-to-cloud-init" \
    | tee hotaisle_provisioned_instance.json

VM_NAME=$(python3 -c 'import json; d = json.load(open("hotaisle_provisioned_instance.json")); print(d["name"]);')

# Deleting as soon as possible
${HOTAISLE_CLI} vm delete --team ${TEAM} --vm ${VM_NAME}

Spinning up a GA Runner

still documenting myself about that. Plan to mimick https://github.com/Cyclenerd/hcloud-github-runner

@v01dXYZ
Copy link
Copy Markdown

v01dXYZ commented Apr 5, 2026

@duburcqa Registering the instance as a runner is actually pretty easy. But I've got to find a way to run the Slurm job.

@v01dXYZ
Copy link
Copy Markdown

v01dXYZ commented Apr 5, 2026

OK, so running the script manually without slurm. Many things are meant for nvidia only... Trying to get the test suite to run even to see how many tests fail.

@duburcqa
Copy link
Copy Markdown
Collaborator Author

duburcqa commented Apr 5, 2026

@v01dXYZ the test suite is running on Apple Metal. It should run on AMDGPU just the same. Otherwise it is a bug, because it is supposed to work.

@v01dXYZ
Copy link
Copy Markdown

v01dXYZ commented Apr 6, 2026

@duburcqa

I try to run the benchmark suite from the production workflow. There are some stuff that are nvidia only:

  • nvidia-smi
  • Usage of /proc/driver/nvidia/gpus

But it is not a big deal and I was able to get past that.

Now the new big deal is to understand if MI300X has raster caps as it is a card without connectors, I'm looking at the code of Mesa but I don't see any.

@duburcqa
Copy link
Copy Markdown
Collaborator Author

duburcqa commented Apr 6, 2026

There are some stuff that are nvidia only:

It should not be the case. nvidia-smi is not mandatory. usage of /proc/driver/nvidia/gpus is not mandatory either. You may be doing something wrong, or not using Genesis main branch.

new big deal is to understand if MI300X has raster caps as it is a card without connectors

Anything can be used for rasterizer rendering. You don't need a GPU for this, and if you have one, all you need is OpenGL 4.1, which is like 10 years old and even supported natively on Apple Metal.

@v01dXYZ
Copy link
Copy Markdown

v01dXYZ commented Apr 6, 2026

I am surely doing something wrong. I'm trying to run the benchmark test that monitor memory and speed.

If you take a look at tests/conftest.py, mem-monitoring is available actually only on NVIDIA: https://github.com/Genesis-Embodied-AI/Genesis/blob/main/tests/conftest.py#L137C1-L149C8

Concerning /proc/drivers/..., it's there https://github.com/Genesis-Embodied-AI/Genesis/blob/main/tests/conftest.py#L221-L237

But it's small stuff (and we already have rocm-smi somewhere else). No biggie, right now trying to get the benchmark running without going a slow path with software rasterization. We'll sort out the little things to add to better support amdgpu.

Concerning rasterization, I failed to enable hardware accelerated rasterization (so it uses llvm-pipe).

@duburcqa
Copy link
Copy Markdown
Collaborator Author

duburcqa commented Apr 6, 2026

I am surely doing something wrong. I'm trying to run the benchmark test that monitor memory and speed.

You cannot monitor memory on non-CUDA devices, but this is not part of the standard workflow. Usually, you should just skip memory profiling:

pytest --print -m "benchmarks" ./tests

Concerning /proc/drivers/..., it's there https://github.com/Genesis-Embodied-AI/Genesis/blob/main/tests/conftest.py#L221-L237

It is guarded by try-except, so it is not blocking. But yes multi-gpu support is only complete on Nvidia GPU.

Concerning rasterization, I failed to enable hardware accelerated rasterization (so it uses llvm-pipe).

This is strange. If you can run commands interactively, you should run a simple simulation with one offscreen camera and enable debug logging. It will prints which rendering backends were tested before falling back to mesa. Not much information though. Forcing the desired backend via PYOPENGL_PLATFORM may help. Alternatively, you should check that OpenGL hardware acceleration is working fine outside Genesis.

@v01dXYZ
Copy link
Copy Markdown

v01dXYZ commented Apr 6, 2026

So after fighting all day long with Mesa, I was able to understand MI300 is considered as a compute only chip and doesn't have support for graphics context neither with radeonsi or radv (info->use_gaphics).

  • vk -> amd/vulkan/radv_physical_device.c radv_is_gpu_supported() returns false for MI300.
  • gl -> gallium/drivers/radeonsi/si_pipe.c si_create_context() returns by failing if the requested GL context its not compute only.

Here the gpu_info of the device (with mesa commit: d4646cd4449b26e9481)

Device info:
    name = GFX940
    marketing_name = AMD Instinct MI300X VF
    num_se = 4
    num_rb = 0
    num_cu = 38
    max_gpu_freq = 2100 MHz
    max_gflops = 10214 GFLOPS
    sqc_inst_cache_size = 64 KB (0 per WGP)
    sqc_scalar_cache_size = 16 KB (0 per WGP)
    tcp_cache_size = 16 KB
    l2_cache_size = 4096 KB
    memory_channels = 16 (TCC blocks)
    memory_size = 192 GB (196352 MB)
    memory_freq = 3 GHz
    memory_bus_width = 8192 bits
    memory_bandwidth = 2663 GB/s
    pcie_gen = 1
    pcie_num_lanes = 1
    pcie_bandwidth = 0.2 GB/s
    clock_crystal_freq = 100000 KHz
    IP COMPUTE  9.4     queues:2        align:256       pad_dw:0x7
    IP SDMA     4.4     queues:2        align:256       pad_dw:0xf
    IP VCN      4.0     queues:1        align:256       pad_dw:0x3f
    IP VCN_JPEG  4.0    queues:1        align:256       pad_dw:0xf
Identification:
    pci (domain:bus:dev.func): 6d61:64:67.70
    pci_id = 0x74b5
    pci_rev_id = 0x0
    family = 75
    gfx_level = 11
    family_id = 141
    chip_external_rev = 71
    chip_rev = 1
Flags:
    family_overridden = 0
    has_graphics = 0
    has_clear_state = 1
    has_distributed_tess = 1
    has_dcc_constant_encode = 0
    has_rbplus = 1
    rbplus_allowed = 0
    has_load_ctx_reg_pkt = 1
    has_out_of_order_rast = 1
    cpdma_prefetch_writes_memory = 0
    has_gfx9_scissor_bug = 0
    has_htile_stencil_mipmap_bug = 0
    has_htile_tc_z_clear_bug_without_stencil = 0
    has_htile_tc_z_clear_bug_with_stencil = 1
    has_small_prim_filter_sample_loc_bug = 0
    has_pops_missed_overlap_bug = 0
    has_32bit_predication = 0
    has_image_opcodes = 0
    never_stop_sq_perf_counters = 0
    has_sqtt_rb_harvest_bug = 0
    has_sqtt_auto_flush_mode_bug = 0
    never_send_perfcounter_stop = 0
    discardable_allows_big_page = 0
    has_taskmesh_indirect0_bug = 0
    has_set_context_pairs = 0
    has_set_context_pairs_packed = 0
    has_set_sh_pairs = 0
    has_set_sh_pairs_packed = 0
    has_set_uconfig_pairs = 0
    mesh_fast_launch_2 = 0
Display features:
    use_display_dcc_unaligned = 0
    use_display_dcc_with_retile_blit = 0
Memory info:
    pte_fragment_size = 2097152
    gart_page_size = 4096
    gart_size = 112741 MB
    vram_size = 196352 MB
    vram_vis_size = 196288 MB
    vram_type = 6
    max_heap_size_kb = 196352 MB
    min_alloc_size = 0
    address32_hi = 0xffff8000
    has_dedicated_vram = 1
    all_vram_visible = 1
    virtual_address_max = 800000000000
    max_tcc_blocks = 16
    tcc_cache_line_size = 128
    tcc_rb_non_coherent = 0
    cp_sdma_ge_use_system_memory_scope = 0
    pc_lines = 0
    lds_size_per_workgroup = 65536
    lds_alloc_granularity = 512
    max_memory_clock = 1300 MHz
CP info:
    gfx_ib_pad_with_type2 = 0
    can_chain_ib2 = 1
    has_cp_dma = 0
    me_fw_version = 0
    me_fw_feature = 0
    mec_fw_version = 32960
    mec_fw_feature = 50
    pfp_fw_version = 0
    pfp_fw_feature = 0
Multimedia info:
    vcn_unified = 4
    vcn_enc_major_version = 1
    vcn_enc_minor_version = 24
    vcn_dec_version = 9
    jpeg_decode = 32
    codec    dec  max_resolution   enc  max_resolution
    mpeg2    -    -                -    -
    mpeg4    -    -                -    -
    vc1      -    -                -    -
    h264     *    4096x4096        -    -
    hevc     *    8192x4352        -    -
    jpeg     *    16384x16384      -    -
    vp9      *    8192x4352        -    -
    av1      *    8192x4352        -    -
Kernel & winsys capabilities:
    drm = 3.64.0
    has_userptr = 1
    has_timeline_syncobj = 1
    has_vm_always_valid = 1
    has_bo_metadata = 1
    has_eqaa_surface_allocator = 1
    has_sparse = 1
    has_gpuvm_fault_query = 1
    has_kernelq_reg_shadowing = 1
    has_default_zerovram_support = 1
    has_tmz_support = 0
    has_trap_handler_support = 0
    IP GFX     max_submitted_ibs = 192
    IP COMPUTE max_submitted_ibs = 125
    IP SDMA    max_submitted_ibs = 49
    IP UVD     max_submitted_ibs = 49
    IP VCE     max_submitted_ibs = 49
    IP UVD_ENC max_submitted_ibs = 49
    IP VCN_DEC max_submitted_ibs = 49
    IP VCN     max_submitted_ibs = 49
    IP VCN_JPEG max_submitted_ibs = 16
    IP VPE     max_submitted_ibs = 49
    kernel_has_modifiers = 0
    uses_kernel_cu_mask = 0
Shader core info:
    cu_mask[SE0][SA0] = 0x1ff   (9)     CU_EN = 0x1ff
    cu_mask[SE1][SA0] = 0x3ff   (10)    CU_EN = 0x3ff
    cu_mask[SE2][SA0] = 0x3ff   (10)    CU_EN = 0x3ff
    cu_mask[SE3][SA0] = 0x1ff   (9)     CU_EN = 0x1ff
    spi_cu_en_has_effect = 0
    max_good_cu_per_sa = 10
    min_good_cu_per_sa = 9
    max_se = 4
    max_sa_per_se = 1                                                                                                                                                                       20:03:01 [1129/1968]
    num_cu_per_sh = 10
    max_scratch_waves = 1280
    has_scratch_base_registers = 1
Compiler info:
    max_waves_per_simd = 10
    num_physical_sgprs_per_simd = 800
    num_physical_wave64_vgprs_per_simd = 256
    num_simd_per_compute_unit = 4
    min_sgpr_alloc = 16
    max_sgpr_alloc = 102
    sgpr_alloc_granularity = 16
    min_wave64_vgpr_alloc = 8
    max_vgpr_alloc = 256
    wave64_vgpr_alloc_granularity = 8
    has_lds_bank_count_16 = 0
    has_sram_ecc_enabled = 1
    has_point_sample_accel = 0
    has_fast_fma32 = 1
    has_fma_mix = 1
    has_mad32 = 0
    has_packed_math_16bit = 1
    has_accelerated_dot_product = 1
    has_image_bvh_intersect_ray = 0
    has_ngg_passthru_no_msg = 0
    local_invocation_ids_packed = 1
    has_3d_cube_border_color_mipmap = 0
    conformant_trunc_coord = 0
    has_attr_ring = 0
    smaller_tcs_workgroups = 0
    has_gfx6_mrt_export_bug = 0
    has_vtx_format_alpha_adjust_bug = 0
    has_smem_oob_access_bug = 0
    has_image_load_dcc_bug = 0
    has_ls_vgpr_init_bug = 0
    has_cb_lt16bit_int_clamp_bug = 0
    has_vrs_frag_pos_z_bug = 0
    has_ngg_fully_culled_bug = 0
    has_attr_ring_wait_bug = 0
    has_primid_instancing_bug = 0
Ring info:
    hs_offchip_workgroup_size = 32768 B
    tess_factor_ring_size = 120 KB
    tess_offchip_ring_size = 5120 KB
Render backend info:
    pa_sc_tile_steering_override = 0x0
    max_render_backends = 16
    num_tile_pipes = 1
    enabled_rb_mask = 0x0
    max_alignment = 0
    pbb_max_alloc_count = 0
GB_ADDR_CONFIG: 0x00000000
    num_pipes = 1
    pipe_interleave_size = 256
    max_compressed_frags = 1
    bank_interleave_size = 1
    num_banks = 1
    shader_engine_tile_size = 16
    num_shader_engines = 1
    num_gpus = 0 (raw)
    multi_gpu_tile_size = 0 (raw)
    num_rb_per_se = 1
    row_size = 1024
    num_lower_pipes = 0 (raw)
    se_enable = 0 (raw)
Modifiers (32bpp):
    GFX9,64KB_D_X,PIPE_XOR_BITS=0,BANK_XOR_BITS=0
    GFX9,64KB_S_X,PIPE_XOR_BITS=0,BANK_XOR_BITS=0
    GFX9,4KB_D_X,PIPE_XOR_BITS=0,BANK_XOR_BITS=0
    GFX9,64KB_D
    GFX9,64KB_S
    LINEAR

Conclusion: Disabling visualisation would be great in the case of MI300.

@duburcqa
Copy link
Copy Markdown
Collaborator Author

duburcqa commented Apr 6, 2026

Conclusion: Disabling visualisation would be great in the case of MI300.

What do you mean exactly by this?

@v01dXYZ
Copy link
Copy Markdown

v01dXYZ commented Apr 6, 2026

I mean if it is possible to disable rendering (so only physics), it would be great as rendering can be offloaded to cheaper GPUs.

@duburcqa
Copy link
Copy Markdown
Collaborator Author

duburcqa commented Apr 6, 2026

I mean if it is possible to disable rendering (so only physics), it would be great as rendering can be offloaded to cheaper GPUs.

Still not clear. In which context? When running the unit tests? example scripts? Systematically? How would to offload rendering? This would be managed by Genesis or the responsibility of the user?

@v01dXYZ
Copy link
Copy Markdown

v01dXYZ commented Apr 6, 2026

Right now, only for benchmark suite (I don't think unit tests run both physics + rendering).

The root problem is the GPU stands idle for quite a long time during benchmarks. I think it is because of rendering but I could be wrong.

About offloading rendering, I think it should be the user replaying the session but this time without physics but only rendering. So managed by the user.

@duburcqa
Copy link
Copy Markdown
Collaborator Author

duburcqa commented Apr 6, 2026

Right now, only for benchmark suite (I don't think unit tests run both physics + rendering).

It is the opposite, the benchmark should not run rendering, but the unit tests do.

The root problem is the GPU stands idle for quite a long time during benchmarks. I think it is because of rendering but I could be wrong.

It is because of compilation. The benchmarks are monitoring compilation time from scratch assuming cache is completely empty. It can take a while.

About offloading rendering, I think it should be the user replaying the session but this time without physics but only rendering. So managed by the user.

I see. At this point this is not supported because there is no way to export the result of a simulation in Genesis. Hopefully this feature will be coming soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support tiled Cholesky for systems with >96 DOFs by opting in to extended GPU shared memory (>48KB)

2 participants