Compute

All-Reduce Operation

Definition

All-reduce is a collective communication operation that aggregates data from all participating processes and distributes the result back to all processes. In distributed training, all-reduce is the primary mechanism for synchronising gradients — each GPU computes gradients on its local data batch, and all-reduce sums these gradients across all GPUs so every GPU has the same updated model parameters. The efficiency of all-reduce operations determines the scaling efficiency of distributed training and is directly constrained by network bandwidth.

Technical Context

NVIDIA's NCCL (NVIDIA Collective Communications Library) implements optimised all-reduce algorithms — ring all-reduce, tree all-reduce, and hybrid approaches — that minimise network traffic and overlap communication with computation. Ring all-reduce divides the data into chunks and passes them around a logical ring of GPUs, achieving bandwidth-optimal transfers. For a message of size M across N GPUs, ring all-reduce requires 2*(N-1)/N * M bytes of total transfer — approaching 2M for large N, independent of GPU count.

Advisory Relevance

All-reduce performance is a useful proxy for cluster quality. Operators that can demonstrate high all-reduce bandwidth utilisation (85%+ of theoretical InfiniBand throughput) are typically running well-engineered deployments. We use all-reduce benchmarks as a quality signal in our technical assessments.

This glossary is maintained by Disintermediate as a reference for GPU infrastructure professionals, investors, and operators. Each entry reflects terminology as used in active advisory engagements and market intelligence work.

View all terms Discuss this topic