Research

Inside a GPU Cluster: Servers, Networking, Storage, and Cooling

Four layers of infrastructure. One integrated system.

[01]

The Compute Layer: GPU Servers

The foundation of a GPU cluster is the compute node;a server containing GPUs, a host CPU, system RAM, and local NVMe storage. Current-generation compute nodes typically hold 8 GPUs in a 4U or 10U chassis.

NVIDIA's HGX platform;used by Supermicro, Dell, and HPE among others;places 8 B200 GPUs on a baseboard that connects them via NVSwitch, NVIDIA's proprietary all-to-all interconnect. Within a single HGX node, all 8 GPUs share a pooled NVLink fabric running at 1.8TB/s aggregate bandwidth.

This internal bandwidth is what makes single-node training of models up to roughly 1TB in size practical. The host CPU;typically dual Intel Xeon or AMD EPYC;handles orchestration, data loading, and network communication.

It is not the computational workhorse; that role belongs entirely to the GPUs. System RAM (typically 1-2TB per node) provides working memory for the CPU and staging for data moving to GPU VRAM. Local NVMe;often 4 to 8 drives at 3.84TB each;provides fast scratch space for model checkpoints and dataset caching. The compute node is not a standalone system. It requires all three other layers to function at full capacity.

[02]

The Network Layer: Cluster Interconnect

A cluster of 1,000 GPUs distributed across 125 servers must communicate as if they are one machine. This requires two distinct networks: a compute fabric for GPU-to-GPU communication, and a storage fabric for data access.

The compute fabric uses either InfiniBand (NDR at 400Gbps per port, HDR at 200Gbps) or high-density Ethernet (400Gbps or 800Gbps). InfiniBand provides lower latency;roughly 600 nanoseconds versus 1-2 microseconds for Ethernet;and has historically dominated AI training clusters.

NVIDIA's acquisition of Mellanox in 2020 consolidated InfiniBand supply through a single vendor. NVIDIA's Spectrum-X Ethernet platform is a newer alternative, designed specifically for AI workloads at 800Gbps. A 128-GPU cluster using NDR InfiniBand requires a spine-leaf topology with at least two spine switches and four to eight leaf switches. Switch costs run to $100,000-$200,000 per unit for high-density AI fabric;networking capex for a serious cluster typically runs 15-25% of total compute capex.

[03]

The Storage Layer: Parallel Filesystems

AI training reads data at speeds that would overwhelm conventional storage. A 1,000-GPU cluster training a large model may require sustained throughput of 500GB/s to 2TB/s from storage to GPU VRAM. Parallel filesystems;Lustre, IBM Spectrum Scale, WekaFS, or VAST Data;are designed specifically for this pattern.

They stripe data across hundreds of storage nodes, allowing all nodes to serve data simultaneously. A WekaFS deployment for a 128-GPU cluster might consist of 16-32 storage nodes, each with 8 NVMe drives, delivering aggregate throughput of 200-400GB/s. Persistent storage (the dataset, model artefacts, long-term checkpoints) typically uses a cost-optimised layer;object storage via Ceph, MinIO, or AWS S3-compatible interfaces;with capacity running to petabytes.

Storage capex for a production AI cluster typically runs 10-20% of compute capex. Under-provisioning storage is a common and expensive mistake;GPU utilisation drops when the storage subsystem cannot feed data fast enough. For a detailed capex and opex model tailored to your cluster design, speak to our advisory team at disintermediate.global/services.

[04]

The Cooling Layer: Managing Kilowatts Per Rack

Current-generation GPU servers generate heat that conventional data centre cooling cannot handle. A rack of 8 B200 GPUs in an NVL72 configuration consumes up to 120kW;roughly 10 times what a standard enterprise rack draws. Air cooling works at power densities up to 20-30kW per rack without extreme engineering.

Above that threshold;which current GPU clusters exceed by a factor of four or more;air cooling becomes impractical. Direct liquid cooling (DLC) circulates water or glycol coolant directly to heat exchangers mounted on CPUs and GPUs. Warm-water cooling operates at 35-45°C supply temperature, meaning standard building chilled water can serve it.

Immersion cooling submerges entire servers in dielectric fluid, removing heat with extreme efficiency at the cost of hardware accessibility. Current NVIDIA NVL72 racks are liquid-cooled by design;not an option. Operators cannot deploy these systems in air-cooled facilities. Fewer than 400 facilities worldwide currently meet the cooling specifications required for current-generation GPU clusters.

[05]

How the Layers Interact

Each layer must be sized to avoid becoming the bottleneck for any other. A cluster with excellent GPUs but inadequate storage throughput will show high GPU idle time;expensive hardware waiting for data.

A cluster with excellent storage but inadequate network will throttle gradient exchange during distributed training. A cluster with excellent compute and network but inadequate cooling will experience thermal throttling;GPUs reducing clock speed to manage temperature;degrading performance silently.

This systems-design challenge is why serious GPU cluster deployment requires domain expertise across all four layers simultaneously. It is not sufficient to buy good GPUs. Disintermediate provides procurement support, financial modelling, and due diligence for GPU infrastructure deployments across all four layers. Get in touch at disintermediate.global/contact.

Key Takeaways
01

An HGX compute node holds 8 GPUs connected via NVSwitch at 1.8TB/s internal bandwidth;the CPU orchestrates but does not compute

02

Networking costs 15-25% of total compute capex; InfiniBand provides 600ns latency versus 1-2μs for Ethernet;latency directly affects training efficiency

03

Parallel filesystems (WekaFS, Lustre) deliver 200-400GB/s aggregate throughput; conventional NAS creates GPU idle time and wasted capex

04

Current GPU racks consume 120kW;4x standard enterprise density;making direct liquid cooling mandatory, not optional

05

Under-provisioning any layer (storage, network, cooling) creates a bottleneck that degrades expensive GPU utilisation

Next Steps

This analysis is produced by Disintermediate, drawing on data from The GPU intelligence platform - tracking 2,800+ companies across 72 categories, real-time GPU pricing from 70+ providers, and advisory engagement experience across the GPU infrastructure value chain.