Skip to content

perf(generator): use memcpy paths for primitive sequence conversion#26

Open
azerupi wants to merge 1 commit into
ros2-rust:mainfrom
azerupi:perf/fast-copy-basic-sequence-from-rmw
Open

perf(generator): use memcpy paths for primitive sequence conversion#26
azerupi wants to merge 1 commit into
ros2-rust:mainfrom
azerupi:perf/fast-copy-basic-sequence-from-rmw

Conversation

@azerupi
Copy link
Copy Markdown

@azerupi azerupi commented May 17, 2026

This is the second of a set of 2 PRs that address issue number 2 described in ros2-rust/ros2_rust#628

ros2-rust/rosidl_runtime_rs#20 has to be merged first.

Problem

2. Element-by-element sequence conversion in from_rmw_message

The generated from_rmw_message code converts Sequence<T> to Vec<T> via .into_iter().collect(). The SequenceIterator::next() implementation (rosidl_runtime_rs/src/sequence.rs) does this per element:

let elem = ptr.read();
ptr.write(std::mem::zeroed::<T>());  // writes zero back for EVERY element

For a 64 KB Sequence<u8>, this is 65,536 individual read + zero-write + insert cycles instead of a single memcpy.

Solution

Generated from_rmw_message and into_rmw_message impls for UnboundedSequence<BasicType> fields now route through memcpy-based conversion paths instead of per-element iteration / writes.

  • Receive side (from_rmw_message) — emits msg.field.into(), which routes through From<Sequence<T>> for Vec<T> (added in ros2-rust/rosidl_runtime_rs#20) → as_slice().to_vec()memcpy. Replaces into_iter().collect(), which paid a per-element read+zero-write via SequenceIterator::next().
  • Publish side (into_rmw_message, Owned arm) — emits msg.field.as_slice().into(), which routes through the existing From<&[T]> for Sequence<T>Sequence::new + clone_from_slicememcpy (via stdlib's Copy specialization on clone_from_slice). Replaces msg.field.into(), which routed through From<Vec<T>> for Sequence<T>extend, paying a per-element write through bounds-checked IndexMut.

The Borrowed arm of into_rmw_message already used msg.field.as_slice().into() and is unchanged. Non-primitive sequence branches (String, WString, nested Message-typed sequences) are unchanged — they need real per-element conversion.

Depends on: ros2-rust/rosidl_runtime_rs#20) for the new From<Sequence<T>> for Vec<T> impl. That PR must be released first before this PR's generated code will compile against a published rosidl_runtime_rs.

Benchmarks

For benchmarks, see ros2-rust/rosidl_runtime_rs#20)

For UnboundedSequence<BasicType> fields, route both conversion
directions through single-memcpy paths instead of element-by-element
loops.

- from_rmw_message: emit msg.field.into() instead of
  msg.field.into_iter().collect(). Routes through rosidl_runtime_rs'
  Copy-bounded From<Sequence<T>> for Vec<T> -> as_slice().to_vec() ->
  memcpy. The old path paid a per-element read + zero-write through
  SequenceIterator::next().

- into_rmw_message Owned arm: emit msg.field.as_slice().into() instead
  of msg.field.into(). Routes through From<&[T]> for Sequence<T> ->
  Sequence::new + clone_from_slice -> memcpy (stdlib specializes
  clone_from_slice for T: Copy). The old path went through
  From<Vec<T>> -> from_iter -> extend, paying a per-element IndexMut
  write.

The Borrowed arm of into_rmw_message already used as_slice().into() and
is unchanged. Non-primitive element types (String, WString, nested
messages) still use .into_iter().map(...).collect() -- they need real
per-element conversion regardless.

Also restructures the from_rmw_message UnboundedSequence branch so each
element-type arm emits a single complete expression rather than sharing
a chained .into_iter()...collect() across arms.

Microbenchmarks (Sequence<u8> 65536 <-> Vec<u8>):
  receive:  23.3 us -> 513 ns (45.4x)
  publish:  28.3 us -> 726 ns (39.0x)

End-to-end (UInt8MultiArray, single-process, reliable QoS,
rmw_fastrtps_cpp, 64 KiB, median of 3 x 5s):
  rclrs (main):            30,578 msg/s
  rclrs (this branch):    135,560 msg/s  (4.4x over main, ~90% of rclcpp)
  rclcpp baseline:        151,279 msg/s
  rclrs + Worker sub:     224,576 msg/s  (~1.48x of rclcpp single-process)

Requires rosidl_runtime_rs with the new Copy-bounded
From<Sequence<T>> for Vec<T> impl (companion PR).

Refs ros2-rust/ros2_rust#628 improvement (1).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant