Skip to content

AZP: Add multi HCAs iodemo test#261

Closed
Alexey-Rivkin wants to merge 1 commit intoyosefe:integration3from
Alexey-Rivkin:integration3
Closed

AZP: Add multi HCAs iodemo test#261
Alexey-Rivkin wants to merge 1 commit intoyosefe:integration3from
Alexey-Rivkin:integration3

Conversation

@Alexey-Rivkin
Copy link
Copy Markdown

What

add test for multi HPCs test for iodemo
(cherry-pick from openucx#7971)

Comment thread buildlib/pr/io_demo/io-demo.yml Outdated
initial_delay: 9999 # No interference
test_name: am_cx6_rc
extra_run_args: ""
"tag match on CX6/DC":
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to remove DC and AM tests from this branch. keep only tag match on CX4/RC

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this PR was going to fix tag match on CX6/RC

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, keep cx6 but remove AM and DC below

Comment thread buildlib/pr/io_demo/io-demo.yml
Comment thread buildlib/pr/io_demo/io-demo.yml Outdated
tls: "dc_x"
test_name: tag_cx6_dc
extra_run_args: ""
"active messages on CX6/DC with data validation":
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this case

Comment thread buildlib/pr/io_demo/io-demo.yml Outdated
initial_delay: 9999 # No interference
test_name: tag_umemh_cx4_rc
extra_run_args: ""
"active messages on CX6/DC with user memh":
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this case

@NirWolfer
Copy link
Copy Markdown

NirWolfer commented May 5, 2026

agent-review

NirWolfer

This comment was marked as outdated.

NirWolfer

This comment was marked as outdated.

NirWolfer

This comment was marked as outdated.

NirWolfer

This comment was marked as outdated.

NirWolfer

This comment was marked as outdated.

@NirWolfer
Copy link
Copy Markdown

agent-review

Copy link
Copy Markdown

@NirWolfer NirWolfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

8 findings — 🚨 3 blockers, ⚠️ 5 minors

test_ucx_tls: ${{ test.Value.tls }}
test_extra_run_args: ${{ test.Value.extra_run_args }}
initial_delay: ${{ coalesce(test.Value.initial_delay, parameters.initial_delay) }}
test_interference: ${{ test.Value.interference }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 BLOCKER

The matrix entries use interference as an optional key, but coalesce is not applied here (unlike initial_delay on line 122). If a test entry omits interference, ${{ test.Value.interference }} expands to an empty string and $(test_interference) will be empty when passed to az-stage-io-demo.yaml, causing the eq(variables['interference'], 'Yes') condition to silently skip the corrupter. Add a coalesce(test.Value.interference, 'Yes') default, matching the pattern used for initial_delay.

net_device=$(ibdev2netdev | sed -ne 's/\(\w*\) port \([0-9]\) ==> '${interface}' .*/\1:\2/p')
net_devices="${net_devices} ${net_device}"
done
export UCX_NET_DEVICES=$(echo ${net_devices##*( )} | tr " ", ",")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 BLOCKER

tr " " "," passes a three-character second argument ("," includes the comma and quote-artefact). The correct invocation is tr ' ' ','. As written, the UCX_NET_DEVICES value will be malformed and UCX will fail to find the device.

addr=$(echo ${host_ip} | cut -d: -f2)
if [ -n "${host}" ] && [ -n "${addr}" ]
then
ip_address_per_host[${host}${net_if}]=${addr}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 BLOCKER

The associative-array key ${host}${net_if} has no delimiter. A host named node1 with interface eth0 produces the same key as host node + interface 1eth0. Use a separator that cannot appear in either value, e.g. ${host}:${net_if}, and apply the same change at every lookup site (lines 465, 676).

$(workspace)/buildlib/az-network-corrupter.sh reset=yes interface=${{ parameters.roce_iface }}
echo "##vso[task.setvariable variable=interference]${{ parameters.interference }}"
for interface in ${{ parameters.roce_iface }} ; do
$(workspace)/buildlib/az-network-corrupter.sh reset=yes interfaces=${interface}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ MINOR

The reset loop calls az-network-corrupter.sh once per interface sequentially. If one interface fails, subsequent interfaces are not reset. The original script already handles multi-interface iteration internally after this PR; passing the full space-separated list as a single interfaces= value would let the script handle it atomically and be consistent with how the corrupter loop is invoked.

duration: 60
interface: $(roce_iface_cx4) $(roce_iface_cx6)
tls: "rc_x"
test_name: multy_cx6_cx4_rc
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ MINOR

Test name multy_cx6_cx4_rc is a typo; should be multi_cx6_cx4_rc. Artifact names and analyzer output will inherit the misspelling.

initial_delay: 9999 # No interference
test_name: tag_umemh_cx4_rc
extra_run_args: ""
"Multi HCAs: tag match on CX4 and CX6 RC ":
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ MINOR

The matrix entry name 'Multi HCAs: tag match on CX4 and CX6 RC ' has a trailing space. YAML allows it but it can cause surprising behaviour if the key is referenced or displayed.

test_name: multy_cx6_cx4_rc
extra_run_args: ""
"different UCX_IB_NUM_PATH":
args: "-A"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ MINOR

args: "-A" for the 'different UCX_IB_NUM_PATH' test case — the args field is not a declared parameter in az-stage-io-demo.yaml and is not forwarded anywhere in the matrix expansion. If -A is an intended io_demo flag it must be placed in extra_run_args; as written it is silently ignored.

for net_if in $(split_list ${net_ifs})
do
port_num=$((base_port_num + i))
client_connect_list+=" ${ip_address_per_host[${host}${net_if}]}:${port_num}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ MINOR

The server-port list now adds one host:port entry per (server, interface) pair. If the intent is that each server listens on all interfaces at the same port, this multiplies the connect list incorrectly. Confirm whether io_demo expects one entry per server or one per (server × interface).

@NirWolfer
Copy link
Copy Markdown

agent-review

NirWolfer

This comment was marked as outdated.

@NirWolfer
Copy link
Copy Markdown

agent-review

Copy link
Copy Markdown

@NirWolfer NirWolfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

⚠️ MINOR buildlib/pr/io_demo/az-stage-io-demo.yaml:64

tr " ", "," passes " ," (space + comma) as tr's source set instead of the intended " " (space only). The result is that both spaces and commas are mapped to commas. For the current input (net_devices is built with space separators and will never contain commas) this is harmless, but the extra comma is a typo and should be tr " " ",".

@NirWolfer
Copy link
Copy Markdown

agent-review

Copy link
Copy Markdown

@NirWolfer NirWolfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

2 findings — 🚨 1 blocker, ⚠️ 1 minor

2 findings posted as inline comments.

@NirWolfer
Copy link
Copy Markdown

agent-review

Copy link
Copy Markdown

@NirWolfer NirWolfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

8 findings — 🚨 4 blockers, ⚠️ 4 minors

8 findings posted as inline comments.

do
port_num=$((base_port_num + i))
client_connect_list+=" ${ip_address_per_host[${host}]}:${port_num}"
for net_if in $(split_list ${net_ifs})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 BLOCKER

The inner loop over interfaces inside the server-slot loop means each server slot i (listening on a single port base_port_num + i) gets one client_connect_list entry per interface, all pointing to the same port but different IPs. With num_tasks=2 (two NICs) and num_servers=2, the list becomes ip_cx4:20000 ip_cx6:20000 ip_cx4:20001 ip_cx6:20001 — four entries for two server processes. Clients are told to connect to four addresses but only two servers exist. The correct model for multi-NIC is one server per NIC, each bound to its own IP, so the connect list should have exactly one entry per server (matching its bound interface). The nesting order should be inverted: outer loop over interfaces, inner loop over server slots per interface, with each interface's server using a distinct port range offset by interface index.

test_ucx_tls: ${{ test.Value.tls }}
test_extra_run_args: ${{ test.Value.extra_run_args }}
initial_delay: ${{ coalesce(test.Value.initial_delay, parameters.initial_delay) }}
test_interference: ${{ test.Value.interference }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 BLOCKER

Tests that omit the interference key (e.g. 'tag match on CX4', 'tag match on CX6/RC') will expand ${{ test.Value.interference }} to an empty string. The template parameter default of 'Yes' is never applied because an explicit (empty) value is passed, so $(test_interference) is empty and eq(variables['interference'], 'Yes') evaluates to false — the network corrupter silently never starts for those tests. Fix: test_interference: ${{ coalesce(test.Value.interference, 'Yes') }}, mirroring the initial_delay pattern on the line above.

net_device=$(ibdev2netdev | sed -ne 's/\(\w*\) port \([0-9]\) ==> '${interface}' .*/\1:\2/p')
net_devices="${net_devices} ${net_device}"
done
export UCX_NET_DEVICES=$(echo ${net_devices##*( )} | tr " ", ",")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 BLOCKER

${net_devices##*( )} uses an extglob pattern to strip leading spaces, but extglob is not enabled by default in bash. Without shopt -s extglob the literal string *( ) is matched instead of 'zero or more spaces', so the stripping silently fails and UCX_NET_DEVICES gets a leading space. Use a POSIX-portable alternative such as echo ${net_devices} | xargs or echo ${net_devices} | sed 's/^ *//'.

- bash: |
set -eEx
$(workspace)/buildlib/az-network-corrupter.sh reset=yes interface=${{ parameters.roce_iface }}
echo "##vso[task.setvariable variable=interference]${{ parameters.interference }}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 BLOCKER

The interference pipeline variable is set via ##vso[task.setvariable] inside the first 'Restore port state' step, which runs under set -eEx and may exit early if a per-interface corrupter-reset call fails before the echo. If that happens, $(test_interference) is never set and all downstream eq(variables['interference'], 'Yes') conditions evaluate to false, silently suppressing the corrupter for all subsequent steps. The variable should be set in a dedicated, unconditional step that precedes any fallible commands.

net_device=$(ibdev2netdev | sed -ne 's/\(\w*\) port \([0-9]\) ==> '${interface}' .*/\1:\2/p')
net_devices="${net_devices} ${net_device}"
done
export UCX_NET_DEVICES=$(echo ${net_devices##*( )} | tr " ", ",")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ MINOR

tr " ", "," passes the two-character set " ," (space and comma) as tr's source, mapping both spaces and commas to commas. The intended translation is space→comma only: tr " " ",". With the current input this is benign (no commas in net_devices), but it is a typo.

set -eEx
./autogen.sh
./contrib/configure-devel --prefix=$(Build.Repository.LocalPath)/install --without-java
./contrib/configure-devel --prefix=$(Build.Repository.LocalPath)/install
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ MINOR

--without-java was removed from the configure line. If the centos7 build container does not have a JDK installed, the build will fail trying to compile Java bindings. This change should be intentional and verified against the container image, or --without-java should be restored.

duration: 60
interface: $(roce_iface_cx4) $(roce_iface_cx6)
tls: "rc_x"
test_name: multy_cx6_cx4_rc
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ MINOR

test_name: multy_cx6_cx4_rc contains a typo ('multy' should be 'multi'). Log directories and published artifact names will inherit this misspelling.

interface: $(roce_iface_cx6)
tls: "rc_x"
extra_run_args: "--env server UCX_IB_NUM_PATHS=2 --env client UCX_IB_NUM_PATHS=1"
test_name: tag_cx4_rc_client_one_path
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ MINOR

test_name: tag_cx4_rc_client_one_path is carried over from the old 'client one path' test but the entry is now named 'different UCX_IB_NUM_PATH'. The log directory and artifact will be misleadingly labelled. Align the test_name with the new entry's purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants