AZP: Add multi HCAs iodemo test#261
AZP: Add multi HCAs iodemo test#261Alexey-Rivkin wants to merge 1 commit intoyosefe:integration3from
Conversation
| initial_delay: 9999 # No interference | ||
| test_name: am_cx6_rc | ||
| extra_run_args: "" | ||
| "tag match on CX6/DC": |
There was a problem hiding this comment.
need to remove DC and AM tests from this branch. keep only tag match on CX4/RC
There was a problem hiding this comment.
but this PR was going to fix tag match on CX6/RC
There was a problem hiding this comment.
right, keep cx6 but remove AM and DC below
| tls: "dc_x" | ||
| test_name: tag_cx6_dc | ||
| extra_run_args: "" | ||
| "active messages on CX6/DC with data validation": |
| initial_delay: 9999 # No interference | ||
| test_name: tag_umemh_cx4_rc | ||
| extra_run_args: "" | ||
| "active messages on CX6/DC with user memh": |
4115bc9 to
1d56eeb
Compare
|
agent-review |
|
agent-review |
NirWolfer
left a comment
There was a problem hiding this comment.
Code Review
8 findings — 🚨 3 blockers,
| test_ucx_tls: ${{ test.Value.tls }} | ||
| test_extra_run_args: ${{ test.Value.extra_run_args }} | ||
| initial_delay: ${{ coalesce(test.Value.initial_delay, parameters.initial_delay) }} | ||
| test_interference: ${{ test.Value.interference }} |
There was a problem hiding this comment.
🚨 BLOCKER
The matrix entries use interference as an optional key, but coalesce is not applied here (unlike initial_delay on line 122). If a test entry omits interference, ${{ test.Value.interference }} expands to an empty string and $(test_interference) will be empty when passed to az-stage-io-demo.yaml, causing the eq(variables['interference'], 'Yes') condition to silently skip the corrupter. Add a coalesce(test.Value.interference, 'Yes') default, matching the pattern used for initial_delay.
| net_device=$(ibdev2netdev | sed -ne 's/\(\w*\) port \([0-9]\) ==> '${interface}' .*/\1:\2/p') | ||
| net_devices="${net_devices} ${net_device}" | ||
| done | ||
| export UCX_NET_DEVICES=$(echo ${net_devices##*( )} | tr " ", ",") |
There was a problem hiding this comment.
🚨 BLOCKER
tr " " "," passes a three-character second argument ("," includes the comma and quote-artefact). The correct invocation is tr ' ' ','. As written, the UCX_NET_DEVICES value will be malformed and UCX will fail to find the device.
| addr=$(echo ${host_ip} | cut -d: -f2) | ||
| if [ -n "${host}" ] && [ -n "${addr}" ] | ||
| then | ||
| ip_address_per_host[${host}${net_if}]=${addr} |
There was a problem hiding this comment.
🚨 BLOCKER
The associative-array key ${host}${net_if} has no delimiter. A host named node1 with interface eth0 produces the same key as host node + interface 1eth0. Use a separator that cannot appear in either value, e.g. ${host}:${net_if}, and apply the same change at every lookup site (lines 465, 676).
| $(workspace)/buildlib/az-network-corrupter.sh reset=yes interface=${{ parameters.roce_iface }} | ||
| echo "##vso[task.setvariable variable=interference]${{ parameters.interference }}" | ||
| for interface in ${{ parameters.roce_iface }} ; do | ||
| $(workspace)/buildlib/az-network-corrupter.sh reset=yes interfaces=${interface} |
There was a problem hiding this comment.
The reset loop calls az-network-corrupter.sh once per interface sequentially. If one interface fails, subsequent interfaces are not reset. The original script already handles multi-interface iteration internally after this PR; passing the full space-separated list as a single interfaces= value would let the script handle it atomically and be consistent with how the corrupter loop is invoked.
| duration: 60 | ||
| interface: $(roce_iface_cx4) $(roce_iface_cx6) | ||
| tls: "rc_x" | ||
| test_name: multy_cx6_cx4_rc |
There was a problem hiding this comment.
Test name multy_cx6_cx4_rc is a typo; should be multi_cx6_cx4_rc. Artifact names and analyzer output will inherit the misspelling.
| initial_delay: 9999 # No interference | ||
| test_name: tag_umemh_cx4_rc | ||
| extra_run_args: "" | ||
| "Multi HCAs: tag match on CX4 and CX6 RC ": |
There was a problem hiding this comment.
The matrix entry name 'Multi HCAs: tag match on CX4 and CX6 RC ' has a trailing space. YAML allows it but it can cause surprising behaviour if the key is referenced or displayed.
| test_name: multy_cx6_cx4_rc | ||
| extra_run_args: "" | ||
| "different UCX_IB_NUM_PATH": | ||
| args: "-A" |
There was a problem hiding this comment.
args: "-A" for the 'different UCX_IB_NUM_PATH' test case — the args field is not a declared parameter in az-stage-io-demo.yaml and is not forwarded anywhere in the matrix expansion. If -A is an intended io_demo flag it must be placed in extra_run_args; as written it is silently ignored.
| for net_if in $(split_list ${net_ifs}) | ||
| do | ||
| port_num=$((base_port_num + i)) | ||
| client_connect_list+=" ${ip_address_per_host[${host}${net_if}]}:${port_num}" |
There was a problem hiding this comment.
The server-port list now adds one host:port entry per (server, interface) pair. If the intent is that each server listens on all interfaces at the same port, this multiplies the connect list incorrectly. Confirm whether io_demo expects one entry per server or one per (server × interface).
|
agent-review |
|
agent-review |
NirWolfer
left a comment
There was a problem hiding this comment.
Code Review
buildlib/pr/io_demo/az-stage-io-demo.yaml:64
tr " ", "," passes " ," (space + comma) as tr's source set instead of the intended " " (space only). The result is that both spaces and commas are mapped to commas. For the current input (net_devices is built with space separators and will never contain commas) this is harmless, but the extra comma is a typo and should be tr " " ",".
|
agent-review |
NirWolfer
left a comment
There was a problem hiding this comment.
Code Review
2 findings — 🚨 1 blocker,
2 findings posted as inline comments.
|
agent-review |
NirWolfer
left a comment
There was a problem hiding this comment.
Code Review
8 findings — 🚨 4 blockers,
8 findings posted as inline comments.
| do | ||
| port_num=$((base_port_num + i)) | ||
| client_connect_list+=" ${ip_address_per_host[${host}]}:${port_num}" | ||
| for net_if in $(split_list ${net_ifs}) |
There was a problem hiding this comment.
🚨 BLOCKER
The inner loop over interfaces inside the server-slot loop means each server slot i (listening on a single port base_port_num + i) gets one client_connect_list entry per interface, all pointing to the same port but different IPs. With num_tasks=2 (two NICs) and num_servers=2, the list becomes ip_cx4:20000 ip_cx6:20000 ip_cx4:20001 ip_cx6:20001 — four entries for two server processes. Clients are told to connect to four addresses but only two servers exist. The correct model for multi-NIC is one server per NIC, each bound to its own IP, so the connect list should have exactly one entry per server (matching its bound interface). The nesting order should be inverted: outer loop over interfaces, inner loop over server slots per interface, with each interface's server using a distinct port range offset by interface index.
| test_ucx_tls: ${{ test.Value.tls }} | ||
| test_extra_run_args: ${{ test.Value.extra_run_args }} | ||
| initial_delay: ${{ coalesce(test.Value.initial_delay, parameters.initial_delay) }} | ||
| test_interference: ${{ test.Value.interference }} |
There was a problem hiding this comment.
🚨 BLOCKER
Tests that omit the interference key (e.g. 'tag match on CX4', 'tag match on CX6/RC') will expand ${{ test.Value.interference }} to an empty string. The template parameter default of 'Yes' is never applied because an explicit (empty) value is passed, so $(test_interference) is empty and eq(variables['interference'], 'Yes') evaluates to false — the network corrupter silently never starts for those tests. Fix: test_interference: ${{ coalesce(test.Value.interference, 'Yes') }}, mirroring the initial_delay pattern on the line above.
| net_device=$(ibdev2netdev | sed -ne 's/\(\w*\) port \([0-9]\) ==> '${interface}' .*/\1:\2/p') | ||
| net_devices="${net_devices} ${net_device}" | ||
| done | ||
| export UCX_NET_DEVICES=$(echo ${net_devices##*( )} | tr " ", ",") |
There was a problem hiding this comment.
🚨 BLOCKER
${net_devices##*( )} uses an extglob pattern to strip leading spaces, but extglob is not enabled by default in bash. Without shopt -s extglob the literal string *( ) is matched instead of 'zero or more spaces', so the stripping silently fails and UCX_NET_DEVICES gets a leading space. Use a POSIX-portable alternative such as echo ${net_devices} | xargs or echo ${net_devices} | sed 's/^ *//'.
| - bash: | | ||
| set -eEx | ||
| $(workspace)/buildlib/az-network-corrupter.sh reset=yes interface=${{ parameters.roce_iface }} | ||
| echo "##vso[task.setvariable variable=interference]${{ parameters.interference }}" |
There was a problem hiding this comment.
🚨 BLOCKER
The interference pipeline variable is set via ##vso[task.setvariable] inside the first 'Restore port state' step, which runs under set -eEx and may exit early if a per-interface corrupter-reset call fails before the echo. If that happens, $(test_interference) is never set and all downstream eq(variables['interference'], 'Yes') conditions evaluate to false, silently suppressing the corrupter for all subsequent steps. The variable should be set in a dedicated, unconditional step that precedes any fallible commands.
| net_device=$(ibdev2netdev | sed -ne 's/\(\w*\) port \([0-9]\) ==> '${interface}' .*/\1:\2/p') | ||
| net_devices="${net_devices} ${net_device}" | ||
| done | ||
| export UCX_NET_DEVICES=$(echo ${net_devices##*( )} | tr " ", ",") |
There was a problem hiding this comment.
tr " ", "," passes the two-character set " ," (space and comma) as tr's source, mapping both spaces and commas to commas. The intended translation is space→comma only: tr " " ",". With the current input this is benign (no commas in net_devices), but it is a typo.
| set -eEx | ||
| ./autogen.sh | ||
| ./contrib/configure-devel --prefix=$(Build.Repository.LocalPath)/install --without-java | ||
| ./contrib/configure-devel --prefix=$(Build.Repository.LocalPath)/install |
There was a problem hiding this comment.
--without-java was removed from the configure line. If the centos7 build container does not have a JDK installed, the build will fail trying to compile Java bindings. This change should be intentional and verified against the container image, or --without-java should be restored.
| duration: 60 | ||
| interface: $(roce_iface_cx4) $(roce_iface_cx6) | ||
| tls: "rc_x" | ||
| test_name: multy_cx6_cx4_rc |
There was a problem hiding this comment.
test_name: multy_cx6_cx4_rc contains a typo ('multy' should be 'multi'). Log directories and published artifact names will inherit this misspelling.
| interface: $(roce_iface_cx6) | ||
| tls: "rc_x" | ||
| extra_run_args: "--env server UCX_IB_NUM_PATHS=2 --env client UCX_IB_NUM_PATHS=1" | ||
| test_name: tag_cx4_rc_client_one_path |
There was a problem hiding this comment.
test_name: tag_cx4_rc_client_one_path is carried over from the old 'client one path' test but the entry is now named 'different UCX_IB_NUM_PATH'. The log directory and artifact will be misleadingly labelled. Align the test_name with the new entry's purpose.
What
add test for multi HPCs test for iodemo
(cherry-pick from openucx#7971)