[MISC] Add support of opt-in shared memory for tiled hessian to improve performance.#2629
Conversation
|
This snippet is crashing on CUDA for now, preventing this PR to pass. import quadrants as qd
qd.init(arch=qd.cuda, debug=False, cfg_optimization=False)
@qd.kernel
def func_solve_init(
nt_H: qd.types.ndarray,
):
BLOCK_DIM = qd.static(64)
MAX_DOFS = qd.static(111) # Slightly over 48Kb, 110 would pass
n_dofs = nt_H.shape[1]
n_dofs_2 = n_dofs**2
n_lower_tri = n_dofs * (n_dofs + 1) // 2
qd.loop_config(block_dim=BLOCK_DIM)
for tid in range(BLOCK_DIM):
H = qd.simt.block.SharedArray((MAX_DOFS, MAX_DOFS + 1), qd.f32)
i_pair = tid
while i_pair < n_lower_tri:
i_d1 = qd.cast(qd.floor((qd.sqrt(qd.cast(8 * i_pair + 1, qd.f32)) - 1.0) / 2.0), qd.i32)
if (i_d1 + 1) * (i_d1 + 2) // 2 <= i_pair:
i_d1 = i_d1 + 1
i_d2 = i_pair - i_d1 * (i_d1 + 1) // 2
H[i_d1, i_d2] = nt_H[0, i_d1, i_d2]
i_pair = i_pair + BLOCK_DIM
qd.loop_config(block_dim=BLOCK_DIM)
for tid in range(BLOCK_DIM):
H = qd.simt.block.SharedArray((MAX_DOFS, MAX_DOFS + 1), qd.f32)
i_flat = tid
while i_flat < n_dofs_2:
i_d1 = i_flat // n_dofs
i_d2 = i_flat % n_dofs
if i_d2 <= i_d1:
H[i_d1, i_d2] = nt_H[0, i_d1, i_d2]
i_flat = i_flat + BLOCK_DIM
nt_H = qd.ndarray(dtype=qd.f32, shape=(1, 102, 102))
func_solve_init(nt_H)Fixed by Genesis-Embodied-AI/quadrants#442 |
c9baa9d to
a5c8a92
Compare
|
|
|
This PR tests should have failed on |
Why are you saying it should fail? Quadrant's unit tests are passing on |
|
@duburcqa Sorry, I thought there is a ROCM pipeline to check if if something don't work. There is none, so you could not have known. I didn't want to be rude and press you over that. Are you interested by adding a CI pipeline that should be triggered at release (to minimize costs)? I tried on |
|
That sounds neat! I will try that. How hard it is to setup a GitHub runner for it? |
|
I'll try it now, and I'll tell you if I was able to register a git runner easily. |
|
By the way, cost is not a blocker. What matters is having a datacenter graded AMD GPU that is easy to use as CI and play with interactively via ssh. If we have that, I’m keen to enable it on every pipeline. |
|
I'll document below how to set up an ephemeral runner using Hot Aisle Provisioning#!/bin/env bash
## /!\ you have to provide the env var TEAM
## $ TEAM=<your-team> ./script.sh
## Before anything, you have to:
##
## * create an Hot Aisle account
## * create a team
## * provide a SSH key to log into instances
## * create a API token
# this script is only to document the process
set -e
## Getting the CLI tool and configuring it
## ---------------------------------------
# Download Hot Aisle cli https://github.com/hotaisle/hotaisle-cli/
HOTAISLE_CLI=./hotaisle-cli-v0.8.17-linux-amd64
if [[ ! -f ${HOTAISLE_CLI} ]]; then
curl -LO 'https://github.com/hotaisle/hotaisle-cli/releases/download/v0.8.17/hotaisle-cli-v0.8.17-linux-amd64.tar.gz'
tar xvf hotaisle-cli-v0.8.17-linux-amd64.tar.gz
fi
if ! (${HOTAISLE_CLI} user get > /dev/null); then
# Set API_TOKEN
if [[ -z HOTAISLE_API_TOKEN ]]; then
read -sp "api token: " HOTAISLE_API_TOKEN
fi
HOTAISLE_API_TOKEN=${HOTAISLE_API_TOKEN} ${HOTAISLE_CLI} config set token
# sanity check
${HOTAISLE_CLI} user get > /dev/null
fi
## Provisioning
## ------------
# To check if there are available GPUs
${HOTAISLE_CLI} vm available --team ${TEAM}
${HOTAISLE_CLI} vm provision \
--team ${TEAM} \
--gpu-count 1 \
--gpu-model "MI300X" \
--user-data-url "url-to-cloud-init" \
| tee hotaisle_provisioned_instance.json
VM_NAME=$(python3 -c 'import json; d = json.load(open("hotaisle_provisioned_instance.json")); print(d["name"]);')
# Deleting as soon as possible
${HOTAISLE_CLI} vm delete --team ${TEAM} --vm ${VM_NAME}Spinning up a GA Runnerstill documenting myself about that. Plan to mimick https://github.com/Cyclenerd/hcloud-github-runner |
|
@duburcqa Registering the instance as a runner is actually pretty easy. But I've got to find a way to run the Slurm job. |
|
OK, so running the script manually without slurm. Many things are meant for nvidia only... Trying to get the test suite to run even to see how many tests fail. |
|
@v01dXYZ the test suite is running on Apple Metal. It should run on AMDGPU just the same. Otherwise it is a bug, because it is supposed to work. |
|
I try to run the benchmark suite from the
But it is not a big deal and I was able to get past that. Now the new big deal is to understand if |
It should not be the case.
Anything can be used for rasterizer rendering. You don't need a GPU for this, and if you have one, all you need is OpenGL 4.1, which is like 10 years old and even supported natively on Apple Metal. |
|
I am surely doing something wrong. I'm trying to run the benchmark test that monitor memory and speed. If you take a look at Concerning But it's small stuff (and we already have Concerning rasterization, I failed to enable hardware accelerated rasterization (so it uses llvm-pipe). |
You cannot monitor memory on non-CUDA devices, but this is not part of the standard workflow. Usually, you should just skip memory profiling: pytest --print -m "benchmarks" ./tests
It is guarded by try-except, so it is not blocking. But yes multi-gpu support is only complete on Nvidia GPU.
This is strange. If you can run commands interactively, you should run a simple simulation with one offscreen camera and enable debug logging. It will prints which rendering backends were tested before falling back to mesa. Not much information though. Forcing the desired backend via |
|
So after fighting all day long with Mesa, I was able to understand
Here the Conclusion: Disabling visualisation would be great in the case of |
What do you mean exactly by this? |
|
I mean if it is possible to disable rendering (so only physics), it would be great as rendering can be offloaded to cheaper GPUs. |
Still not clear. In which context? When running the unit tests? example scripts? Systematically? How would to offload rendering? This would be managed by Genesis or the responsibility of the user? |
|
Right now, only for benchmark suite (I don't think unit tests run both physics + rendering). The root problem is the GPU stands idle for quite a long time during benchmarks. I think it is because of rendering but I could be wrong. About offloading rendering, I think it should be the user replaying the session but this time without physics but only rendering. So managed by the user. |
It is the opposite, the benchmark should not run rendering, but the unit tests do.
It is because of compilation. The benchmarks are monitoring compilation time from scratch assuming cache is completely empty. It can take a while.
I see. At this point this is not supported because there is no way to export the result of a simulation in Genesis. Hopefully this feature will be coming soon. |
Related Issue
Resolves #2626
Checklist:
Submitting Code Changessection of CONTRIBUTING document.