Compute

Tensor Parallelism

Definition

Tensor parallelism is a distributed computing strategy that splits individual neural network layers across multiple GPUs, enabling models too large for a single GPU's memory to be computed in parallel. Each GPU holds a slice of every layer's weight matrix and computes its portion of the operation simultaneously. Tensor parallelism operates within a single NVLink domain (typically 8 GPUs in one node) because it requires extremely high bandwidth and low latency between GPUs — the communication-to-computation ratio makes it impractical across network-connected nodes.

Technical Context

In a transformer model, the attention and feed-forward layers contain large matrix multiplications that can be partitioned along rows or columns. With 8-way tensor parallelism, each GPU stores 1/8 of the weight matrices and processes 1/8 of the computation, synchronising results after each operation. The Megatron-LM framework pioneered efficient tensor parallelism for transformer models. The key constraint is that tensor parallelism requires all-to-all communication at every layer, consuming NVLink bandwidth proportional to the number of GPUs and the size of intermediate activations.

Advisory Relevance

Understanding parallelism strategies helps us evaluate whether operator pricing and capacity planning assumptions are realistic. A company claiming to serve large model training needs tensor-parallel-capable nodes with NVLink — PCIe-connected GPUs are insufficient, regardless of what the sales deck claims.

This glossary is maintained by Disintermediate as a reference for GPU infrastructure professionals, investors, and operators. Each entry reflects terminology as used in active advisory engagements and market intelligence work.

View all terms Discuss this topic