The Traditional Computing Model
Enterprise computing is transactional. A database query, a web request, a payroll calculation;each is a discrete, bounded operation with a defined input and a defined output. The system processes many such operations concurrently, each completing in milliseconds.
Infrastructure is designed around this pattern: lots of CPU cores handling short tasks, modest memory per core, fast storage for random access, and network optimised for modest data volumes to many endpoints. Failure is handled by redundancy and retry. If a server fails mid-request, the request fails and is retried on another server.
Consistency and availability dominate the design criteria. Cost scales with the number of concurrent users, not the size of individual operations. A bank running 100,000 transactions per second needs different infrastructure from a bank running 1,000;but the shape of the infrastructure is broadly similar. AI training breaks all of these assumptions.
Training: Sustained, Synchronised, Enormous
Training a large language model is a single, monolithic computation that runs for weeks or months. There is no request-response cycle. There is no user waiting for an answer. The training job reads the entire dataset repeatedly, adjusts model weights based on prediction errors, and iterates until the model reaches acceptable performance.
This has several unusual properties. First, the computation is sustained;not bursty. A 1,000-GPU training job runs every GPU at near-100% utilisation for weeks. Enterprise compute rarely sustains 100% utilisation beyond a few minutes.
Second, the computation is synchronised. In distributed training, all GPUs must complete each iteration before any GPU can proceed to the next. One slow GPU stalls the entire cluster. This makes fault tolerance radically different: a single GPU failure can halt a training job that has been running for days.
Checkpoint frequency;saving model state at intervals;determines how much work is lost when a failure occurs. Third, data volumes are massive. Training datasets for frontier models run to petabytes. Checkpoint files for large models run to terabytes. Storage throughput becomes a significant infrastructure concern.
Inference: Fast, Latency-Sensitive, Variable
Inference;using a trained model to generate responses;is structurally different from training. It looks more like traditional computing: a user sends a request, the model generates a response, the user receives it.
But the internal mechanics are unlike any traditional enterprise workload. Inference on a large language model involves loading the model into GPU VRAM, processing the input tokens through dozens of transformer layers, and generating output tokens one at a time.
Each generated token depends on all previous tokens, which means inference is inherently sequential at the token level;you cannot parallelise across the output. You can parallelise across users (batching multiple requests together), but this requires careful scheduling to manage latency. A model serving 1,000 concurrent users must batch those requests efficiently to maximise GPU utilisation while maintaining acceptable response times. The latency requirements are strict;users expect responses within seconds, not minutes.
Memory and Storage Patterns
Traditional enterprise servers carry 256GB to 2TB of RAM;generous for transactional workloads where each user session requires modest memory. AI inference for a 70-billion-parameter model requires roughly 140GB of VRAM in FP16 format;before accounting for any batching overhead.
Serving a single model requires one or more high-end GPUs. Storage access patterns are also unusual.
Training reads data sequentially at enormous throughput;a 1,000-GPU cluster training on a large dataset may demand 1TB/s of sustained storage throughput. Enterprise SAN arrays are designed for random IOPS, not streaming throughput at this scale. Parallel filesystems like Lustre, WekaFS, or IBM Spectrum Scale are designed specifically for this pattern. Using a conventional NAS for AI training throughput will bottleneck the cluster; the GPUs will wait idle for data.
Network Requirements: Orders of Magnitude Different
Enterprise networking is typically 10Gbps or 25Gbps per server, with 100Gbps uplinks for high-density deployments. AI training is completely different.
Distributed training using techniques like data parallelism or tensor parallelism requires GPUs to exchange gradient updates or activations after every training step. At model scale, this can mean 100GB or more of data transferred between GPUs per second, per GPU.
A 1,000-GPU cluster running tensor-parallel training may require 800Gbps per GPU;80 times the bandwidth of a well-provisioned enterprise server. This is why AI clusters use InfiniBand or high-density Ethernet fabrics specifically engineered for GPU-to-GPU communication, with latencies in the microsecond range. Enterprise switches are not merely inadequate;they would throttle the cluster so severely that distributed training becomes impractical. For a detailed capex and opex model tailored to your deployment, contact Disintermediate at disintermediate.global/services.
Training is sustained, synchronised, and monolithic;one job consuming all resources for weeks; enterprise compute is bursty and concurrent
A single GPU failure can halt a multi-week training job; checkpoint strategy determines data loss;fundamentally different from enterprise fault tolerance
Serving a 70B-parameter model requires 140GB+ VRAM before batching; enterprise RAM provisioning models do not apply
Parallel filesystems (Lustre, WekaFS) are required for training data throughput; conventional NAS becomes a bottleneck
GPU clusters require 400-800Gbps per-GPU network bandwidth;30-80x what enterprise infrastructure provisions