How to Evaluate a GPU Cloud Provider

[01]

Hardware Generation and Configuration

The first question is which GPU you are actually getting. Marketing materials routinely list hardware models without specifying memory configuration, interconnect type, or chassis generation.

An H100 80GB SXM5 (data centre class, connected via NVSwitch) is meaningfully different from an H100 80GB PCIe (consumer/workstation class, standard PCIe bus);the SXM5 variant has 3.35TB/s memory bandwidth versus 2TB/s for PCIe, and NVSwitch interconnect versus slower PCIe P2P. Verify hardware specifications in writing before committing capacity.

Ask providers for their exact server model, BIOS settings, and NVIDIA CUDA driver versions. A provider running outdated drivers will constrain access to new features and optimisations. Also confirm GPU memory oversubscription policy;some providers configure vGPU instances that share a physical GPU between multiple tenants. vGPU instances are appropriate for light workloads but unacceptable for training or serious inference.

[02]

Network Architecture and Performance

Inter-GPU network performance determines whether distributed training will run efficiently. Expanding beyond 8 GPUs requires servers to communicate via cluster network;InfiniBand or Ethernet.

Ask providers what interconnect technology they use between servers, at what bandwidth per GPU, and what latency they guarantee. Request benchmark results: all-reduce bandwidth, point-to-point latency between nodes, and any published MLPerf benchmark scores.

Providers who cannot supply these figures are either hiding underperformance or have not measured. Also ask about network topology;how many hops between any two nodes, whether spine-leaf or fat-tree topology is used, and whether network fabric is shared with other tenants. A provider with a 4:1 oversubscribed network fabric will show dramatically lower distributed training throughput than advertised bandwidth suggests.

[03]

Storage Performance

Storage throughput is frequently the hidden bottleneck in GPU infrastructure. A provider with excellent GPUs and network but slow storage will show high GPU idle time;an expensive failure mode that is difficult to diagnose without experience. Ask providers what storage system they offer, measured throughput per node and aggregate, and whether storage is local NVMe, network-attached, or a shared parallel filesystem.

For training workloads, request sustained sequential read throughput benchmarks. Anything below 10GB/s aggregate per 8-GPU node will likely create storage bottlenecks for large-scale training. For inference workloads, random read latency matters more;model loading speed affects cold-start latency significantly.

[04]

Support Quality and Operational Track Record

Hardware specifications are table stakes. Support quality differentiates providers at the margin. Ask for incident response time commitments in writing: how quickly will a technician respond to a hardware failure, and what SLA applies to restoration?

For a training job running on 64 GPUs, a 4-hour hardware failure causes meaningful loss. Check provider uptime history;most serious providers publish status pages with historical incident records. Review 12 months of incidents.

A provider with three multi-hour outages in 12 months is not suitable for production training workloads regardless of price. Also assess financial stability. Several GPU cloud providers have failed in 2023-2025 as market dynamics shifted. A provider with poor unit economics, high customer concentration, or limited funding runway may not survive a market downturn.

[05]

Commercial Terms and Exit Risk

The commercial terms of GPU cloud contracts carry more risk than most buyers recognise. Key clauses to scrutinise: termination rights (can the provider terminate your contract with 30 days notice?), data portability (what happens to your data and checkpoints if the provider fails?), price change provisions (can they increase prices on reserved capacity?), and force majeure definitions. Data lock-in is a real risk in GPU cloud.

If your model artefacts, training datasets, and experiment logs live entirely on a provider's proprietary storage, migrating requires moving potentially petabytes of data. Maintain your own object storage layer in a multi-cloud configuration. Disintermediate evaluates GPU cloud providers across all four dimensions and supports contract negotiation as part of advisory engagements;get in touch at disintermediate.global/services.

Key Takeaways

H100 SXM5 (NVSwitch, 3.35TB/s bandwidth) versus H100 PCIe (2TB/s);verify exact server model, not just chip name, before committing

Distributed training performance depends on inter-node interconnect; request all-reduce benchmarks and ask about network oversubscription ratio

Storage throughput below 10GB/s per 8-GPU node creates training bottlenecks;an expensive failure mode that marketing materials never mention

Review 12 months of incident history; assess provider financial stability before signing long-term reserved capacity contracts

Data lock-in risk is real;maintain model artefacts in provider-agnostic storage; use provider storage only as fast cache

Hardware Generation and Configuration

Network Architecture and Performance

Storage Performance

Support Quality and Operational Track Record

Commercial Terms and Exit Risk

GPU Procurement & Capex Benchmarking

Bare Metal vs Managed GPU Cloud

Hyperscalers vs Neoclouds: Two Models for GPU Infrastructure

How GPU Cloud Pricing Works: On-Demand, Reserved, and Spot