Out of a toy interest, I am trying to run OpenCL and the tree likelihood computation library BEAGLE to run on a PI. BEAGLE assumes that work sizes are divisible by 16, because that's handy for nucleotide substitution matrices, and it fails to run on the 12×12×12 work size limit of VC4CL on the Pi.
Unfortutately, I don't know much about low-level programming and hardware (and I really don't understand any of OpenCL, the Pi's GPU architecture, or what what the work size actually means, sorry), so the question I ask may be a bit dumb:
Would it be possible to change the work size?
I have been looking for the source of the magic number here in the repository and found this comment
|
* "The work-items in a given work-group execute concurrently on the processing elements of a single compute |
|
* unit." (page 24) Since there is no limitation, that work-groups need to be executed in parallel, we set 1 |
|
* compute unit with all 12 QPUs, allowing us to run 12 work-items in a single work-group in parallel and run |
|
* the work-groups sequentially. |
If work items can in part be executed sequentially – could I be taught to set some of the work size limits to 48 (the lcm of 12 and 16) for a small performance hit, or is that number embedded too deeply in the code and would require a lot of changes in other places? Like
|
V3D::instance()->getSystemInfo(SystemInfo::QPU_COUNT), param_value_size, param_value, param_value_size_ret); |
Out of a toy interest, I am trying to run OpenCL and the tree likelihood computation library BEAGLE to run on a PI. BEAGLE assumes that work sizes are divisible by 16, because that's handy for nucleotide substitution matrices, and it fails to run on the 12×12×12 work size limit of VC4CL on the Pi.
Unfortutately, I don't know much about low-level programming and hardware (and I really don't understand any of OpenCL, the Pi's GPU architecture, or what what the work size actually means, sorry), so the question I ask may be a bit dumb:
Would it be possible to change the work size?
I have been looking for the source of the magic number here in the repository and found this comment
VC4CL/src/vc4cl_config.h
Lines 140 to 143 in 842d444
If work items can in part be executed sequentially – could I be taught to set some of the work size limits to 48 (the lcm of 12 and 16) for a small performance hit, or is that number embedded too deeply in the code and would require a lot of changes in other places? Like
VC4CL/src/Kernel.cpp
Line 339 in a00572f