Training vs Inference: Two Very Different Infrastructure Problems

[01]

What Training Is

Training is the process of teaching a model. You start with random or pre-trained weights, feed the model a dataset, measure how wrong its predictions are (the loss), and adjust the weights in the direction that reduces that loss.

Repeat billions of times. The result is a model whose weights encode knowledge about the patterns in the training data.

Training has several key properties from an infrastructure perspective. It is compute-intensive;you run the same operations (forward pass, loss calculation, backward pass, weight update) repeatedly for days or weeks.

It is memory-intensive;the model weights, the gradients, and the optimiser state must all fit in GPU VRAM simultaneously. For large models, this requires multiple GPUs. It is fault-intolerant;a hardware failure midway through a long training run can lose days of work. It is not latency-sensitive;the user is not waiting; the job just runs until it converges. And it is highly parallelisable;you can distribute training across thousands of GPUs by splitting the data, the model, or both.

[02]

What Inference Is

Inference is using a trained model to generate output. A user sends a prompt, the model produces a response. From an infrastructure perspective, inference is almost the opposite of training in its requirements.

It is latency-sensitive;a user is waiting, and responses that take more than a few seconds feel slow. Acceptable latency budgets for production LLM inference are typically under 2 seconds for first token, with 50-100 tokens per second for subsequent generation. It is throughput-sensitive;a production system serving thousands of users simultaneously must batch requests efficiently.

It is memory-bound rather than compute-bound;the model weights must be loaded into VRAM, and most inference time is spent on memory reads rather than arithmetic. It has variable demand;usage spikes during business hours and drops at night, requiring either excess capacity or dynamic scaling. And it is highly multi-tenant;multiple users share the same model and the same GPUs, making isolation and scheduling critical.

[03]

Why the Same GPU Serves Both;Differently

The same H100 or B200 GPU can run training and inference, but the configuration, software stack, and economic model differ significantly. For training: you want maximum GPU utilisation, maximum VRAM usage, full batch size, mixed-precision arithmetic (BF16 or FP8 for speed), and synchronous gradient exchange via fast interconnects. For inference: you want minimum latency per token, efficient batching of concurrent requests, and high tokens-per-second per GPU.

You use quantisation (FP8, INT4, INT8) to fit larger models in less VRAM or fit more users per GPU. The software stacks also differ: training typically uses PyTorch or JAX with distributed training frameworks (DeepSpeed, Megatron-LM, FSDP). Inference uses specialised serving systems: vLLM (optimised for LLM serving with PagedAttention memory management), TensorRT-LLM (NVIDIA's optimised inference engine), or Triton Inference Server.

[04]

The Economics Are Different Too

Training economics favour high-GPU-count, long-duration jobs. The revenue model is simple: a customer pays for reserved or on-demand GPU time, runs their training job, and stops. Margins depend on utilisation;keeping the GPUs busy between training jobs is the challenge.

Inference economics are more complex. The revenue unit is not a GPU-hour but an API call;typically priced per 1,000 tokens generated. A B200 running optimised vLLM inference can serve 3,000-5,000 tokens per second for a 70B-parameter model, generating $0.60-$1.50/minute in revenue at standard market pricing.

GPU hardware cost for that capacity is approximately $22-$28/hour at cloud rates;or $0.37-$0.47/minute. The gross margin on inference, before infrastructure overhead, can reach 50-70% for well-optimised deployments. The catch: inference demand is variable. A cluster at 30% utilisation during off-peak hours has identical fixed costs and much lower revenue.

[05]

Choosing the Right Architecture

Most serious AI deployments use separate infrastructure for training and inference;not because you must, but because optimising the same cluster for both is an operational compromise. Training clusters prioritise high-speed interconnects (InfiniBand NDR, high-density NVSwitch), large VRAM per GPU, and synchronous job scheduling. Inference clusters prioritise low-latency networking, high GPU density per server, and asynchronous request routing.

A common architecture: train on a reserved cluster of 64-256 GPUs with NVLink and InfiniBand; serve inference on a separate pool of 8-32 GPUs behind an API gateway, with spot capacity for peak demand. Operators building a single cluster and trying to serve both training and inference simultaneously typically find utilisation below 50% for training and latency above target for inference. For a detailed capex and opex model covering training and inference cluster architecture, speak to our advisory team at disintermediate.global/services.

Key Takeaways

Training: compute-intensive, memory-intensive, fault-intolerant, not latency-sensitive, highly parallelisable;optimised for throughput

Inference: latency-sensitive (under 2s first token), memory-bound, variable demand, multi-tenant;optimised for tokens/second and concurrent users

Different software stacks: training uses DeepSpeed/FSDP/Megatron-LM; inference uses vLLM, TensorRT-LLM, Triton;same GPU, different optimisation

Inference gross margins can reach 50-70% for optimised deployments; training margins are tighter and depend heavily on utilisation

Separate training and inference clusters;each optimised for its workload;outperform combined clusters that compromise both

What Training Is

What Inference Is

Why the Same GPU Serves Both;Differently

The Economics Are Different Too

Choosing the Right Architecture

Single-Tenant vs Multi-Tenant GPU Infrastructure

What Is a GPU and Why Does AI Need Them?

How AI Workloads Differ from Traditional Computing

How GPU Cloud Pricing Works: On-Demand, Reserved, and Spot