Research

InfiniBand, Ethernet, and NVLink: Networking for AI Explained

Three networking technologies. One cluster. Very different tradeoffs.

[01]

Why Network Bandwidth Determines Training Speed

Distributed AI training;running a training job across multiple GPUs or multiple servers;requires GPUs to exchange data constantly. In data-parallel training, each GPU processes a different batch of data and computes gradients; those gradients must be aggregated across all GPUs before the next step begins. In tensor-parallel training, the model itself is split across GPUs, requiring activations to flow between them with every forward and backward pass.

The communication volume scales with model size. For a 70-billion-parameter model, gradient exchange at each step can require hundreds of gigabytes of data movement. If the network cannot move this data quickly enough, GPUs sit idle waiting for communication to complete.

This idle time is called the communication overhead, and it directly reduces cluster efficiency. A cluster achieving 40% compute utilisation because 60% of time is spent waiting for network transfers is wasting 60% of its capital cost. Network is not plumbing. It is a primary determinant of whether your cluster runs at 40% or 90% efficiency.

[02]

InfiniBand: The Historical Standard

InfiniBand is a networking technology developed specifically for high-performance computing in the 1990s. It offers two properties that matter for AI: low latency and Remote Direct Memory Access (RDMA).

RDMA allows one server to read from or write to another server's memory without involving the destination server's CPU. This reduces communication overhead dramatically;instead of CPU-mediated transfers, data moves directly between GPU memory via the network interface.

Current InfiniBand generations: HDR at 200Gbps per port, NDR at 400Gbps per port, and XDR (emerging) at 800Gbps. Latency is approximately 600 nanoseconds end-to-end in a properly configured cluster. NVIDIA acquired Mellanox;the dominant InfiniBand vendor;in 2020 for $6.9 billion, giving NVIDIA control of the dominant interconnect technology used in AI clusters alongside its GPU monopoly. InfiniBand's weaknesses: it is proprietary, expensive (NDR adapters run $2,000-$4,000 per port), and requires specialised switching infrastructure.

[03]

High-Speed Ethernet: The Contender

Ethernet at 400Gbps and 800Gbps has closed much of the latency gap with InfiniBand through two mechanisms. First, RDMA over Converged Ethernet (RoCE) implements RDMA semantics over standard Ethernet;providing the low-latency, CPU-bypass data transfer that made InfiniBand attractive.

Second, improved switch silicon from Broadcom, Marvell, and Intel has reduced switching latency significantly. NVIDIA's Spectrum-X platform;released in 2023 targeting AI workloads;delivers 800Gbps per port with congestion control algorithms specifically designed for all-to-all AI communication patterns.

Spectrum-X800 switches cost approximately $191,000 per unit at current pricing (Q1 2026). Ethernet's advantage is ecosystem breadth.

More vendors, more competition, more operational expertise, and compatibility with standard network monitoring tooling. Its remaining weakness versus InfiniBand: higher latency in tail cases. In large clusters with many concurrent flows, InfiniBand's deterministic latency is difficult to match. For clusters below 512 GPUs, the practical difference is marginal. For 1,000+ GPU clusters doing synchronous training, InfiniBand retains a measurable efficiency advantage.

[04]

NVLink and NVSwitch: Inside the Server

NVLink is NVIDIA's proprietary interconnect that operates entirely within a single server. It is not a replacement for InfiniBand or Ethernet;it is a complement, operating at a different tier.

Current NVLink 4.0 operates at 1.8TB/s aggregate bandwidth between GPUs on a single HGX baseboard. NVSwitch;the routing fabric that connects NVLinks;allows all-to-all communication between all 8 GPUs on a node simultaneously.

The NVL72 rack;72 B200 GPUs in a single rack connected via NVSwitch;is effectively a single 72-GPU supercomputer operating at NVLink bandwidth rather than network bandwidth. Communication within the NVL72 rack happens at 130TB/s aggregate bandwidth. Communication leaving the rack via InfiniBand or Ethernet happens at one to two orders of magnitude less. This tiered architecture;NVLink inside the server, InfiniBand or Ethernet between servers;is standard for current AI clusters.

[05]

Choosing the Right Topology

The practical decision for an operator deploying an AI cluster today involves three variables: cluster scale, workload type, and capital budget. For clusters below 256 GPUs used primarily for inference or fine-tuning: 400Gbps Ethernet with RoCE is cost-effective and operationally simpler. For clusters of 256-1,024 GPUs running large-scale training: InfiniBand NDR or NVIDIA Spectrum-X 800Gbps are both reasonable choices.

For clusters above 1,024 GPUs: InfiniBand NDR or XDR becomes strongly preferred for training, as latency variance at scale compounds into significant efficiency losses. Topology matters too. Fat-tree and Dragonfly topologies minimise the number of hops between any two GPUs in large clusters.

Rail-optimised topologies;where each GPU in a server connects to a different spine switch;ensure that node failures do not concentrate network traffic. An incorrectly designed fabric can reduce training throughput by 20-40% on an otherwise well-specified cluster. For detailed network architecture assessment and procurement support, contact Disintermediate at disintermediate.global/contact.

Key Takeaways
01

Network communication overhead directly reduces GPU utilisation;a poorly networked cluster can waste 60% of its compute capital

02

InfiniBand provides ~600ns latency with RDMA; optimal for large-scale synchronous training above 512 GPUs

03

RoCE (RDMA over Ethernet) at 400-800Gbps closes the latency gap with InfiniBand for most sub-1,000-GPU workloads

04

NVLink (1.8TB/s within a server) and cluster networking (InfiniBand/Ethernet) operate at different tiers;both are required

05

Topology design (fat-tree, Dragonfly, rail-optimised) affects training throughput by 20-40% independently of raw bandwidth

Next Steps

This analysis is produced by Disintermediate, drawing on data from The GPU intelligence platform - tracking 2,800+ companies across 72 categories, real-time GPU pricing from 70+ providers, and advisory engagement experience across the GPU infrastructure value chain.