cs_dispatch_queue.h provides 3 objects that implement a
SYCL-like device task management system
based on 3 objects:
-
cs_dispatch_queue: The main object to interact with, allows running tasks on the host and the device with dependency management. -
cs_task/cs_host_task: Allow interacting with tasks launched bycs_dispatch_queueby waiting for end of a task or extracting an event to synchronize with. -
cs_event: Represents an event triggered at the end of a task for other tasks to synchronize with. It is used as an input ofcs_dispatch_queuemethods to express dependencies between tasks.
cs_dispatch_queue is meant to be used in a way similar to
sycl::queue.
It can be used to declare parallel tasks on a device that depend on
(ie. wait for) each other. It holds a context that can be accessed as a member
named initializer_context which is used to initialize each task created by the
cs_dispatch_queue.
-
In device mode, each task is spawned from the
initializer_contextand its own CUDA stream for asynchronous, parallel execution. -
In host mode, tasks are run synchronously.
Tasks will be waited upon at destruction. This behavior ensures resources held for a task are not destroyed before its termination.
cs_device_queue wraps all of cs_dispatch_queue's algorithms to run them
as separated tasks represented by cs_task. An additional single_task method
can be used to declare tasks run on the host and represented by cs_host_task.
-
cs_taskholds resources relative to the execution of a task started withcs_dispatch_queue(ie. a CUDA stream in device mode). -
cs_host_taskis a specialization ofcs_taskthat holds additional data necessary for the execution of a host task.
cs_event represent a time point in a task to synchronize with.
-
In device mode, they are implemented on top of
cudaEvent_t. -
In host mode, they are only used for time measurement as the execution is synchronous.
Both cs_task and cs_host_task provide an operator to generate a cs_event
for synchronization or time measurement.
The system can be used as SYCL (without memory handle based dependency management) on top of CUDA:
inline void
_cs_sycl_like_example()
{
// Inspired by:
// https://enccs.github.io/sycl-workshop/task-graphs-synchronization/#how-to-specify-dependencies
cs_dispatch_queue Q;
Q.initializer_context.set_use_gpu(true);
constexpr std::size_t N = 16 * 1024 * 1024;
int *a, *b;
CS_MALLOC_HD(a, N, int, CS_ALLOC_HOST_DEVICE_SHARED);
CS_MALLOC_HD(b, N, int, CS_ALLOC_DEVICE);
{
// Start task A asynchronously
auto task_a =
Q.parallel_for(N, [=] CS_F_HOST_DEVICE(std::size_t id) { a[id] = 1; });
// Start task B asynchronously
auto task_b =
Q.parallel_for(N, [=] CS_F_HOST_DEVICE(std::size_t id) { b[id] = 2; });
// Start task C asynchronously, which depends on tasks A and B
auto task_c =
Q.parallel_for(N,
{ task_a, task_b },
[=] CS_F_HOST_DEVICE(std::size_t id) { a[id] += b[id]; });
// Start task D asynchronously, which depends on task C.
// NB: a single_task runs on the host, not the device.
int value = 3;
auto task_d = Q.single_task(
{ task_c },
[=](int *data) -> void {
for (int i = 1; i < N; i++) {
data[0] += data[i];
}
data[0] /= value;
},
a);
// NB: Tasks are waited upon at destruction so this is not necessary:
// task_d.wait();
}
CS_FREE_HD(a);
CS_FREE_HD(b);
}