{"id":1066,"date":"2026-05-14T16:59:50","date_gmt":"2026-05-14T06:59:50","guid":{"rendered":"https:\/\/catharsis.net.au\/blog\/?p=1066"},"modified":"2026-05-14T17:35:02","modified_gmt":"2026-05-14T07:35:02","slug":"nvidea-dgx-spark-benchmarking","status":"publish","type":"post","link":"https:\/\/catharsis.net.au\/blog\/nvidea-dgx-spark-benchmarking\/","title":{"rendered":"From 96 to 340 Gbit\/s: Benchmarking a Two-Node DGX Spark Cluster"},"content":{"rendered":"\n<p class=\"has-pale-pink-background-color has-background\">How we wired two NVIDIA DGX Spark workstations through a MikroTik CRS804-4DDQ switch, discovered four hardware ceilings, and ended up moving 340 Gbit\/s full-duplex between GPUs with the CPU sitting at idle.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The setup<\/h2>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"We Hit 340 Gbit\/s Between Two NVIDIA DGX Sparks - Demo\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/xQrUVveIdEQ?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>Two NVIDIA DGX Spark workstations sitting next to each other in the lab, each with an NVIDIA GB10 superchip (Grace ARM CPU + Blackwell GPU, unified LPDDR5X memory). Each Spark has two 200 GbE QSFP56 ports on a ConnectX-7 SmartNIC for clustering, plus a 10 GbE RJ45 for management.<\/p>\n\n\n\n<p>Sitting between them: a MikroTik CRS804-4DDQ \u2014 a four-port 400 G QSFP-DD switch built on a Marvell 98DX7335 ASIC. Each physical port can carry a single 400 G connection or be split into 2\u00d7 200 G or 4\u00d7 100 G. RouterOS 7.22.<\/p>\n\n\n\n<p>The question was simple: <em>what&#8217;s the best inter-node bandwidth we can actually get?<\/em><\/p>\n\n\n\n<p>The answer turned out to be layered, because TCP, RDMA, and the silicon all have different ceilings.<\/p>\n\n\n\n<p>Special thanks to Wireless Professional Solutions to supply NVIDIA DGX Spark hardware \u2014 see specs, pricing, and availability: <a href=\"https:\/\/wisp.net.au\/nvidia-dgx-spark.html\" target=\"_blank\" rel=\"noreferrer noopener\">Link<\/a><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Stage 1 \u2014 Just plug it in and ping<\/h2>\n\n\n\n<p>Two 200 G DACs, one from each DGX, into the same physical QSFP-DD port on the switch using a 2\u00d7 200 G breakout. Both DGXes get DHCP addresses on <code>192.168.88.0\/24<\/code> from RouterOS. Management interface on the laptop comes up on the same subnet. Pings work. Done.<\/p>\n\n\n\n<p>This was the easy part. Then we ran our first benchmark.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>iperf3 -c &lt;peer&gt; -P 16 -t 20<\/code><\/pre>\n\n\n\n<p>Result: <strong>96.2 Gbit\/s.<\/strong> Half of the 200 G line rate. The expected number on an untuned setup.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Stage 2 \u2014 Why is the TCP number so low?<\/h2>\n\n\n\n<p>At 200 Gbit\/s, the CPU spends most of its budget on <strong>packets per second<\/strong>, not bytes. With the default Ethernet MTU of 1500, every gigabit of throughput costs around 80,000 packet headers, copies, and interrupt entries. On a Grace CPU this caps out fast.<\/p>\n\n\n\n<p>Three knobs needed turning:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>MTU 9000 (jumbo frames)<\/strong> on every hop \u2014 host NICs and the switch ports.<\/li>\n\n\n\n<li><strong>TCP socket buffers<\/strong> large enough to accommodate the bandwidth-delay product (200 Gbit\/s \u00d7 1 ms RTT \u2248 25 MB per stream).<\/li>\n\n\n\n<li><strong>NIC offloads<\/strong> (TSO, GSO, GRO) on, which they already were.<\/li>\n<\/ol>\n\n\n\n<p><strong>MikroTik side<\/strong> turned out to be more involved than just <code>set mtu=9000<\/code>. RouterOS 7 clamps MTU to the configured L2MTU, and the L2MTU on a 200 G port was sitting at 1584 by default. The working ports on the same switch had explicit overrides for everything \u2014 auto-negotiation off, speed forced to <code>200G-baseCR4<\/code>, FEC mode <code>fec91<\/code>. We replicated that config on the new ports:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/interface\/ethernet\/set qsfp56-dd-2-1 \\\n    auto-negotiation=no \\\n    speed=200G-baseCR4 \\\n    fec-mode=fec91 \\\n    l2mtu=9216 mtu=9000<\/code><\/pre>\n\n\n\n<p><strong>Host side<\/strong> got a sysctl bump:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sysctl -w net.core.rmem_max=536870912\nsysctl -w net.core.wmem_max=536870912\nsysctl -w net.ipv4.tcp_rmem=\"4096 87380 134217728\"\nsysctl -w net.ipv4.tcp_wmem=\"4096 65536 134217728\"\nsysctl -w net.core.netdev_max_backlog=300000\nip link set &lt;iface&gt; mtu 9000<\/code><\/pre>\n\n\n\n<p>End-to-end verification with <code>ping -M do -s 8972<\/code> \u2014 full-size jumbo, don&#8217;t-fragment flag set \u2014 went through cleanly with sub-millisecond RTT.<\/p>\n\n\n\n<p>We re-ran iperf3 with the same <code>-P 16<\/code> and got&#8230; <strong>98 Gbit\/s<\/strong>.<\/p>\n\n\n\n<p>Tuning helped exactly 2%. Something else was the bottleneck.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Stage 3 \u2014 Discovering the multi-host architecture<\/h2>\n\n\n\n<p>Each DGX Spark reported two active 200 G interfaces (<code>enp1s0f0np0<\/code> and <code>enP2p1s0f0np0<\/code>), both at 200 Gbit\/s, both with separate MAC addresses. The switch FDB confirmed both MACs appeared on the <em>same<\/em> switch sub-port:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>4C:BB:47:83:75:FD  qsfp56-dd-1-5\n4C:BB:47:83:76:01  qsfp56-dd-1-5    \u2190 same port!<\/code><\/pre>\n\n\n\n<p>So the DGX Spark presents its single physical 200 G port as <strong>two logical PCIe interfaces<\/strong> with different MACs. This is ConnectX-7&#8217;s &#8220;Multi-Host&#8221; feature \u2014 the same NIC silicon is accessible via two PCIe paths (PCIe domains <code>0000:01<\/code> and <code>0002:01<\/code>), and each path gets a fraction of the bandwidth.<\/p>\n\n\n\n<p>Empirically we confirmed each logical interface caps at roughly <strong>100 Gbit\/s<\/strong> of TCP throughput. The two halves of the 200 G physical pipe are usable independently:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Test<\/th><th>Throughput<\/th><\/tr><\/thead><tbody><tr><td>Single TCP stream, single logical NIC<\/td><td>21 Gbit\/s<\/td><\/tr><tr><td>16-stream TCP, single logical NIC<\/td><td>98 Gbit\/s<\/td><\/tr><tr><td>16-stream TCP, both logical NICs in parallel<\/td><td><strong>177 Gbit\/s<\/strong><\/td><\/tr><tr><td>RDMA <code>ib_send_bw -q 8<\/code>, single logical NIC<\/td><td>99 Gbit\/s<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>That parallel-NIC number \u2014 177 Gbit\/s \u2014 is 88% of the 200 G physical wire on a single cable, which is what production-tuned 200 GbE TCP looks like on a Grace CPU. Good. But we were still leaving 23 Gbit\/s on the table, and there was nothing left to tune host-side without going to RDMA.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Stage 4 \u2014 Adding a second cable<\/h2>\n\n\n\n<div class=\"wp-block-group is-layout-grid wp-container-core-group-is-layout-9649a0d9 wp-block-group-is-layout-grid\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"576\" height=\"1024\" src=\"https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/asd-576x1024.jpg\" alt=\"\" class=\"wp-image-1072\" srcset=\"https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/asd-576x1024.jpg 576w, https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/asd-169x300.jpg 169w, https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/asd-768x1365.jpg 768w, https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/asd-865x1536.jpg 865w, https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/asd-1153x2048.jpg 1153w, https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/asd-scaled.jpg 1441w\" sizes=\"auto, (max-width: 576px) 100vw, 576px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"577\" height=\"1024\" src=\"https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/cables-577x1024.jpg\" alt=\"\" class=\"wp-image-1073\" srcset=\"https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/cables-577x1024.jpg 577w, https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/cables-169x300.jpg 169w, https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/cables-768x1363.jpg 768w, https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/cables-865x1536.jpg 865w, https:\/\/catharsis.net.au\/blog\/wp-content\/uploads\/2026\/05\/cables.jpg 888w\" sizes=\"auto, (max-width: 577px) 100vw, 577px\" \/><\/figure>\n<\/div>\n\n\n\n<p>The plan was to plug a second cable from each DGX into the switch&#8217;s free 400 G ports and see whether we could hit 400 Gbit\/s aggregate. Each DGX has two physical 200 G ports total; only one was wired up. Plug the other one in, configure the new switch ports to match the working ones, and you&#8217;ve doubled the available wire bandwidth.<\/p>\n\n\n\n<p>After plugging, the switch saw new transceivers but no link. Same problem as before: the new ports had RouterOS defaults (auto-neg yes, FEC auto), and the ConnectX-7 DACs need the manual overrides. After applying the same config:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>status=link-ok    qsfp56-dd-2-1\nstatus=link-ok    qsfp56-dd-2-5<\/code><\/pre>\n\n\n\n<p>Both new cables came up at 200 G. Total wire capacity now 400 Gbit\/s per DGX, 800 Gbit\/s aggregate full-duplex on the cluster.<\/p>\n\n\n\n<p>Then we re-ran the TCP test with <strong>four parallel iperf3 streams<\/strong>, one per logical NIC, each pinned to disjoint CPU sets:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Streams<\/th><th>Aggregate TCP<\/th><\/tr><\/thead><tbody><tr><td>2 (single cable, both logical NICs)<\/td><td>177 Gbit\/s<\/td><\/tr><tr><td>4 (both cables, all four logical NICs)<\/td><td><strong>171 Gbit\/s<\/strong><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Adding the second cable made TCP <em>slightly worse<\/em>. Each stream got a smaller slice of CPU; the cables themselves never came close to saturating.<\/p>\n\n\n\n<p>The Grace CPU&#8217;s TCP stack tops out around 170\u2013180 Gbit\/s of aggregate throughput, regardless of how many physical cables are available. That&#8217;s the <strong>kernel TCP ceiling<\/strong>, not a network ceiling.<\/p>\n\n\n\n<p>To go past it we had to get rid of TCP entirely.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Stage 5 \u2014 NCCL and RDMA: the real measurement<\/h2>\n\n\n\n<p>The &#8220;right&#8221; way to measure inter-node bandwidth on a DGX is the way DGX clusters actually communicate: <strong>NCCL<\/strong> (NVIDIA Collective Communications Library) running over <strong>RoCEv2<\/strong> (RDMA over Converged Ethernet). This is the protocol stack NCCL uses for multi-GPU all-reduce, all-gather, sendrecv, etc. \u2014 every multi-node ML training job in the world is built on it.<\/p>\n\n\n\n<p>Setup on both DGXes:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo apt install -y libnccl2 libnccl-dev libopenmpi-dev openmpi-bin\ngit clone https:\/\/github.com\/NVIDIA\/nccl-tests\ncd nccl-tests &amp;&amp; make -j8 MPI=1 \\\n    CUDA_HOME=\/usr\/local\/cuda \\\n    MPI_HOME=\/usr\/lib\/aarch64-linux-gnu\/openmpi \\\n    NCCL_HOME=\/usr<\/code><\/pre>\n\n\n\n<p>Plus passwordless SSH between the two hosts (mpirun&#8217;s launch agent), and the four RoCE devices listed explicitly:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>mpirun -np 2 -H 192.168.88.248:1,192.168.88.252:1 \\\n    --mca plm_rsh_agent ssh \\\n    -x NCCL_IB_HCA=rocep1s0f0,rocep1s0f1,roceP2p1s0f0,roceP2p1s0f1 \\\n    -x LD_LIBRARY_PATH=\/usr\/local\/cuda\/lib64:\/usr\/lib\/aarch64-linux-gnu \\\n    .\/build\/sendrecv_perf -b 256M -e 4G -f 2 -g 1 -n 30 -w 5<\/code><\/pre>\n\n\n\n<p>Result at 4 GB messages:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#       size         busbw (GB\/s)\n   268435456            20.14\n   536870912            20.26\n  1073741824            20.28\n  2147483648            20.25\n  4294967296            20.38<\/code><\/pre>\n\n\n\n<p><strong>20.38 GB\/s busbw<\/strong>. In NCCL&#8217;s accounting for a 2-rank sendrecv, that means <strong>163 Gbit\/s of data crossing the network in each direction simultaneously<\/strong>. Full-duplex aggregate: <strong>326 Gbit\/s.<\/strong><\/p>\n\n\n\n<p>Live PHY-layer counters (read straight from the ConnectX-7 silicon via <code>ethtool -S | grep _bytes_phy<\/code>) confirmed it: ~84 Gbit\/s TX + ~84 Gbit\/s RX on each of the two cables, peaking at <strong>340 Gbit\/s of bytes physically moving across the wire<\/strong>.<\/p>\n\n\n\n<p>The CPU was idle the entire time.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why CPU is idle while 340 Gbit\/s flies through the box<\/h2>\n\n\n\n<p>This is the whole point of RDMA, and it&#8217;s worth understanding because it explains why you can&#8217;t iperf your way to honest DGX-class bandwidth numbers.<\/p>\n\n\n\n<p><strong>TCP path<\/strong> (what iperf3 measures):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>GPU memory \u2192 copy to CPU buffer \u2192 kernel TCP stack \u2192 syscall \u2192\ndriver \u2192 NIC \u2192 wire<\/code><\/pre>\n\n\n\n<p>Every byte traverses the kernel. Header processing, checksums, segmentation, retransmit queues, copy_from_user \u2014 all of it eats CPU cycles. At 200 Gbit\/s the CPU is fully consumed packetizing.<\/p>\n\n\n\n<p><strong>RDMA path<\/strong> (what NCCL uses):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>GPU memory \u2500\u2500PCIe DMA\u2500\u2500\u2192 NIC \u2500\u2500wire\u2500\u2500\u2192 NIC \u2500\u2500PCIe DMA\u2500\u2500\u2192 GPU memory<\/code><\/pre>\n\n\n\n<p>NCCL posts a single work-queue entry into the ConnectX-7&#8217;s hardware: &#8220;read 4 GB from address X, deliver to remote address Y.&#8221; The NIC&#8217;s DMA engine fetches the data straight out of memory, packetizes in silicon, pushes to the wire, and signals completion via an interrupt. <strong>The kernel doesn&#8217;t see the bytes. No copies. No syscalls per packet.<\/strong><\/p>\n\n\n\n<p>On a DGX Spark this is even more elegant than on a discrete-GPU system, because the GB10 has <strong>unified LPDDR5X memory<\/strong> \u2014 the GPU and CPU share the same physical RAM. So &#8220;GPU memory&#8221; is just an address range, and the NIC DMAs straight into and out of the unified pool. There&#8217;s no PCIe-CPU-PCIe-GPU bounce buffer.<\/p>\n\n\n\n<p>What the GPU is &#8220;doing&#8221; during the test isn&#8217;t compute \u2014 it&#8217;s orchestration. NCCL runs tiny CUDA kernels that build the RDMA work requests and poll completion queues. The actual data movement is silicon.<\/p>\n\n\n\n<p>The CPU&#8217;s contribution is microseconds of work to post the initial work request. After that, it goes idle until the NIC raises a completion interrupt some seconds later.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Final results<\/h2>\n\n\n\n<p>Same hardware, same cables, three protocols:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>Protocol<\/th><th>Per-direction<\/th><th>Full-duplex aggregate<\/th><th>CPU usage<\/th><th>What&#8217;s limiting<\/th><\/tr><\/thead><tbody><tr><td>TCP \u2014 single stream<\/td><td>21 Gbit\/s<\/td><td>42 Gbit\/s<\/td><td>One core saturated<\/td><td>Single-stream TCP window\/CPU<\/td><\/tr><tr><td>TCP \u2014 16 parallel streams, 1 cable<\/td><td>98 Gbit\/s<\/td><td>196 Gbit\/s<\/td><td>Heavy<\/td><td>iperf3 single-thread data path<\/td><\/tr><tr><td>TCP \u2014 2 logical NICs parallel, 1 cable<\/td><td>177 Gbit\/s<\/td><td>354 Gbit\/s<\/td><td>Heavy<\/td><td>Grace CPU TCP stack<\/td><\/tr><tr><td>TCP \u2014 4 logical NICs, 2 cables<\/td><td>172 Gbit\/s<\/td><td>344 Gbit\/s<\/td><td>Heavy<\/td><td>Same \u2014 adding cables doesn&#8217;t help<\/td><\/tr><tr><td>NCCL\/RDMA \u2014 2 cables<\/td><td><strong>163 Gbit\/s<\/strong><\/td><td><strong>326 Gbit\/s<\/strong><\/td><td><strong>Idle<\/strong><\/td><td>Approaching wire speed<\/td><\/tr><tr><td>PHY-layer observed peak<\/td><td>\u2014<\/td><td><strong>340 Gbit\/s<\/strong><\/td><td>Idle<\/td><td>The cables themselves<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>A few takeaways:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>TCP is the wrong measurement tool above ~150 Gbit\/s on ARM.<\/strong> The Grace CPU&#8217;s TCP stack tops out around 170-180 Gbit\/s regardless of how many cables you have. This isn&#8217;t a tuning issue \u2014 it&#8217;s the cost of running TCP through the kernel.<\/li>\n\n\n\n<li><strong>Adding cables doesn&#8217;t help TCP.<\/strong> Going from one cable to two doubled wire capacity but TCP throughput stayed flat. The bottleneck moved from the wire to the kernel, and you can&#8217;t tune past it.<\/li>\n\n\n\n<li><strong>NCCL\/RDMA delivers what the silicon promises.<\/strong> 326 Gbit\/s aggregate is 82% of the 400 G total per-direction capacity across two cables, and the CPU is asleep. That&#8217;s the protocol DGX clusters were built for.<\/li>\n\n\n\n<li><strong>The second cable doubles redundancy and full-duplex capacity, not single-direction throughput.<\/strong> Each cable is 200 G in each direction (400 G full-duplex per cable). The pair gives 800 G full-duplex of theoretical capacity. NCCL was using about 40% of that with default settings. There&#8217;s almost certainly more available with PFC tuning for RoCEv2 lossless mode, GDR enabling, and channel\/QP increases.<\/li>\n\n\n\n<li><strong>ConnectX-7 Multi-Host is a real architectural choice.<\/strong> Each 200 G physical port presents as two ~100 G logical interfaces to the OS. This is great for parallelism but means single-flow apps see half the wire.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Reproducing this<\/h2>\n\n\n\n<p>If you&#8217;ve got two ConnectX-7-equipped machines and a CRS804-4DDQ (or any 400 G switch), the recipe:<\/p>\n\n\n\n<p><strong>Switch:<\/strong> match working port config exactly. Auto-neg off, speed forced, FEC explicit, L2MTU large enough for jumbo:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/interface\/ethernet\/set &lt;port&gt; \\\n    auto-negotiation=no speed=200G-baseCR4 fec-mode=fec91 \\\n    l2mtu=9216 mtu=9000<\/code><\/pre>\n\n\n\n<p><strong>Host:<\/strong> sysctls and MTU:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo ip link set &lt;iface&gt; mtu 9000\nsudo sysctl -w net.core.{r,w}mem_max=536870912 \\\n    net.ipv4.tcp_rmem=\"4096 87380 134217728\" \\\n    net.ipv4.tcp_wmem=\"4096 65536 134217728\" \\\n    net.core.netdev_max_backlog=300000<\/code><\/pre>\n\n\n\n<p>For multi-IP same-subnet on multiple interfaces:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo sysctl -w \\\n    net.ipv4.conf.all.arp_ignore=1 \\\n    net.ipv4.conf.all.arp_announce=2 \\\n    net.ipv4.conf.all.rp_filter=2<\/code><\/pre>\n\n\n\n<p><strong>Benchmark:<\/strong> NCCL is the canonical measurement. Build nccl-tests, set up passwordless SSH between hosts, and:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>mpirun -np 2 -H host1:1,host2:1 \\\n    -x NCCL_IB_HCA=&lt;list of RoCE devices&gt; \\\n    .\/build\/sendrecv_perf -b 256M -e 4G -f 2 -g 1 -n 30 -w 5<\/code><\/pre>\n\n\n\n<p>The reported <code>busbw<\/code> \u00d7 8 is your per-direction wire bandwidth in Gbit\/s.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">The live view<\/h2>\n\n\n\n<p>We also built a small Python tool that runs the NCCL benchmark and renders live PHY-layer throughput bars next to it. It reads <code>tx_bytes_phy<\/code> and <code>rx_bytes_phy<\/code> from <code>ethtool -S<\/code> (those are the only counters that see RDMA bytes \u2014 kernel <code>\/sys\/class\/net\/&lt;iface&gt;\/statistics\/tx_bytes<\/code> misses everything that doesn&#8217;t touch the IP stack).<\/p>\n\n\n\n<p>The display shows TX\/RX bars per cable filling from gray to cyan as the test ramps from 256 MB messages up to 4 GB, with a running NOW\/PEAK aggregate at the bottom. Peak observed in our runs: <strong>340 Gbit\/s full-duplex<\/strong> while the CPU sat at single-digit utilization.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">\u2554\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2557\n\u2551   DGX Spark Pair \u2014 Live Inter-Node Bandwidth   (NCCL sendrecv \u00b7 RoCEv2 RDMA)         \u2551\n\u2551   Node A \u21c4 CRS804-4DDQ \u21c4 Node B   |   2\u00d7 200G ConnectX-7 cables                      \u2551\n\u2560\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2563\n\u2551  Cable 1  (200G)   TX \u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591  82.4 Gb\/s              \u2551\n\u2551                    RX \u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591  85.6 Gb\/s              \u2551\n\u2551  Cable 2  (200G)   TX \u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591  83.1 Gb\/s              \u2551\n\u2551                    RX \u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591\u2591  84.8 Gb\/s              \u2551\n\u2560\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2563\n\u2551   NOW      TX  165.5 Gb\/s  RX  170.4 Gb\/s  Full-duplex  335.9 Gb\/s                   \u2551\n\u2551   PEAK     TX  168.0 Gb\/s  RX  171.2 Gb\/s  Full-duplex  338.8 Gb\/s                   \u2551\n\u255a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u255d<\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">What&#8217;s next<\/h2>\n\n\n\n<p>There&#8217;s more to extract. Things on the list:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>PFC + ECN for lossless RoCEv2.<\/strong> RouterOS supports priority flow control. With lossless transport NCCL can push closer to line rate without ECN-driven backoff.<\/li>\n\n\n\n<li><strong>GDR (GPUDirect RDMA) wiring.<\/strong> Current runs report <code>GDR 0<\/code> \u2014 the GPU is using a host-side staging step. Enabling true GDR (where the NIC reads directly from GPU memory regions without unified-memory indirection) should help on workloads with non-contiguous tensor layouts.<\/li>\n\n\n\n<li><strong>NCCL_NCHANNELS_PER_NET_PEER tuning.<\/strong> We set this to 4 explicitly; sweeping it may show a sweet spot.<\/li>\n\n\n\n<li><strong>Higher rank counts.<\/strong> Two ranks is the simplest test. Three+ ranks would test ring vs tree algorithm choices and reveal collective scaling.<\/li>\n\n\n\n<li><strong><code>ib_write_bw<\/code> with proper QP fanout.<\/strong> As a sanity check against NCCL, raw RDMA write benchmarks should hit similar or slightly higher numbers.<\/li>\n<\/ul>\n\n\n\n<p>For our purposes \u2014 measuring what two DGX Sparks can actually deliver between each other over a small switch \u2014 <strong>326 Gbit\/s of useful workload bandwidth, scaling linearly with two cables, with the CPU idle<\/strong>, is what the platform is designed to do. That&#8217;s the number worth quoting.<\/p>\n\n\n\n<p>The platform delivers exactly what NVIDIA says it does, <em>as long as you measure it with the right protocol.<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><em>Get this set up \u2014 or build any AI project \u2014 with us<\/em><\/h2>\n\n\n\n<p class=\"has-light-green-cyan-background-color has-background\">The journey above touched every layer of the stack: RouterOS sub-port configuration, ConnectX-7 multi-host quirks, kernel TCP tuning, RoCEv2 setup, NCCL parameter selection, persistent configuration, benchmark methodology. None of it is rocket science. All of it is fiddly enough to silently cap your cluster at half its capability if you skip a step. <\/p>\n\n\n\n<p class=\"has-light-green-cyan-background-color has-background\">We help teams across the full spectrum of AI work \u2014 infrastructure setup, RDMA tuning, NCCL benchmarking, cluster commissioning, model deployment and serving, fine-tuning, retrieval pipelines, and custom AI integrations end-to-end. If you&#8217;re standing up DGX hardware, scoping an AI project, or want a sanity check on something already in flight, we&#8217;d be glad to help.<\/p>\n\n\n\n<p><em>Reach us: <a href=\"mailto:info@catharsis.net.au\">info@catharsis.net.au<\/a><\/em><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How we wired two NVIDIA DGX Spark workstations through a MikroTik CRS804-4DDQ switch, discovered four hardware ceilings, and ended up moving 340 Gbit\/s full-duplex between GPUs with the CPU sitting at idle. The setup Two NVIDIA DGX Spark workstations sitting next to each other in the lab, each with an NVIDIA GB10 superchip (Grace ARM [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":1067,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[69],"tags":[73,74,71,72,70],"class_list":["post-1066","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-ai","tag-artificial-intelligence","tag-dgx","tag-dgx-spark","tag-nvidea"],"_links":{"self":[{"href":"https:\/\/catharsis.net.au\/blog\/wp-json\/wp\/v2\/posts\/1066","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/catharsis.net.au\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/catharsis.net.au\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/catharsis.net.au\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/catharsis.net.au\/blog\/wp-json\/wp\/v2\/comments?post=1066"}],"version-history":[{"count":12,"href":"https:\/\/catharsis.net.au\/blog\/wp-json\/wp\/v2\/posts\/1066\/revisions"}],"predecessor-version":[{"id":1084,"href":"https:\/\/catharsis.net.au\/blog\/wp-json\/wp\/v2\/posts\/1066\/revisions\/1084"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/catharsis.net.au\/blog\/wp-json\/wp\/v2\/media\/1067"}],"wp:attachment":[{"href":"https:\/\/catharsis.net.au\/blog\/wp-json\/wp\/v2\/media?parent=1066"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/catharsis.net.au\/blog\/wp-json\/wp\/v2\/categories?post=1066"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/catharsis.net.au\/blog\/wp-json\/wp\/v2\/tags?post=1066"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}