Compute

Inference Endpoint

Definition

An inference endpoint is a deployed model serving layer that accepts input data and returns predictions or generated content via API. Unlike training, which is a batch process running on reserved clusters, inference is a real-time service with latency requirements — typically under 100ms for interactive applications. Inference endpoints can run on a wide range of hardware: from fractional GPU instances for small models to multi-GPU configurations for large language models. The shift from training to inference as the dominant GPU workload is reshaping the market — Gartner estimates inference will represent 55%+ of AI IaaS spend.

Technical Context

Inference serving frameworks include NVIDIA Triton, vLLM, TensorRT-LLM, and text-generation-inference (TGI). Key metrics are latency (time to first token, inter-token latency), throughput (tokens per second), and cost per token. Inference workloads are typically less bandwidth-sensitive than training, reducing the need for InfiniBand and enabling deployment on Ethernet-connected nodes or even single GPUs. Quantisation (reducing model precision from FP16 to INT8 or INT4) is widely used to reduce inference costs.

Advisory Relevance

The training-to-inference shift is a structural change affecting GPU infrastructure economics. Inference workloads favour different hardware configurations, pricing models, and facility locations than training. We evaluate how operators are positioning for this transition in our due diligence and strategy work.

This glossary is maintained by Disintermediate as a reference for GPU infrastructure professionals, investors, and operators. Each entry reflects terminology as used in active advisory engagements and market intelligence work.

View all termsDiscuss this topic