Sparse heaps#392
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dbc3b55b61
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6d5015647
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
CC-Yeh
left a comment
There was a problem hiding this comment.
What do you think about having a warm pool of heaps to reduce TTFT? we can trade some memory for latency. Might be useful for short inputs.
| let device_capabilities = MetalDeviceCapabilities::from_device(&device); | ||
|
|
||
| let page_size = MTLSparsePageSize::KB256; | ||
| let heap_capacity = 64 * 4 * page_size.byte_size().as_u64() as usize; |
There was a problem hiding this comment.
Maybe we could make this configurable and smaller for iphones, bigger for macs
There was a problem hiding this comment.
Maybe we could make this configurable and smaller for iphones, bigger for Macs
Or even for different cases we can have different pools
Sizes will be set up together with first usage of sparse buffers. For now I just add some value
| }) | ||
| .collect(); | ||
|
|
||
| cmd_queue.update_buffer_mappings(buffer, Some(&self.heap), &mtl_operations); |
There was a problem hiding this comment.
potential race? Do we need fence/barrier for intra/inter queue cases?
There was a problem hiding this comment.
Caller must be responsible for synchronizations because in case of many maps, for example, it's not good idea to synchronize every map
There was a problem hiding this comment.
Thanks, how can we enforce this? Maybe tests + documentation can help
|
|
||
| #[derive(Clone, PartialEq)] | ||
| pub(super) struct MetalSparseHeapBufferMapping { | ||
| gpu_address: u64, |
There was a problem hiding this comment.
Could gpu_address be brittle if Drop fails or finished partially?
Maybe store a pointer to buffer? Or we could store the mapping on buffer side?
There was a problem hiding this comment.
Could gpu_address be brittle if Drop fails or finished partially?
How it's possible?
Anyway after drop() call object is not usable anymore.
Or I didn't understand what you mean.
There was a problem hiding this comment.
The pool keys is bookkeeping by gpu_address, a value Metal can reassign to a new buffer later, so any orphan entry left behind by a partial Drop becomes indistinguishable from a live entry.
But in case of duplicate entries, probably first unmap would release the heap and second one is a no-op.
| } | ||
| } | ||
|
|
||
| heap.execute(buffer, &context.command_queue4, &mappings, true); |
There was a problem hiding this comment.
early return when no overlapping to skip expensive op?
There was a problem hiding this comment.
There are nothing expensive inside if mappings are empty
There was a problem hiding this comment.
My understanding is that cpu will talk to GPU even if mappings is empty, maybe this is cheap I'm not sure.
Planned to do in the future |
No description provided.