Power: The Dominant Cost
Power typically represents 30-40% of ongoing opex. A single H100 (700W nameplate) or B200 (up to 1,000W under sustained load) in a dense cluster consumes 84 kWh/day at full utilisation.
At $0.08/kWh (US average), that's $6.70/day per GPU in power alone. Power costs vary 3-5x geographically: $0.12/kWh in Virginia, $0.04/kWh in Iowa, $0.03/kWh in Iceland.
For a 1,000-GPU cluster at 70% average utilisation, annual power cost ranges from $680k (Iowa) to $2.4M (Virginia). Newer GPUs with higher power draw (B200 versus H100) increase this baseline. Power efficiency matters enormously. Liquid cooling reduces cooling costs by 10-15% versus air cooling, but adds maintenance overhead.
Staffing and Operations
Operations teams typically run 10-15% of opex. A 1,000-GPU cluster requires 4-6 FTE technicians ($300k-450k annual cost, fully loaded), plus infrastructure engineers, SREs, and management.
This scales sublinearly (10,000 GPUs doesn't need 40-60 people), but the fixed-cost floor is surprisingly high. Remote operations centres can reduce this by 15-20% but require automation and tooling. Staff attrition during supply constraints (when everyone is hiring) can spike costs temporarily.
Networking, Cooling, and Maintenance
Networking (interconnects, switches, redundancy) typically costs 5-10% of opex. High-bandwidth clusters (multi-node training) demand faster interconnects (Infiniband, 400G Ethernet) than inference-focused deployments.
Cooling costs 3-8% depending on climate and cooling technology. Air cooling in cool climates (Ireland, Iceland) is negligible.
Liquid cooling in warm climates (southern US) can hit 8-10%. Maintenance (spare parts, component replacement, warranty management) adds 5-8%. Modern GPUs have 3-5 year useful life; older clusters accumulate failures. Facilities costs (rack rental, power infrastructure, physical security) account for the remaining 3-5%.
The Operational Leverage Inflection
Opex structure creates operational leverage. At 40% utilisation, opex/revenue is unsustainable.
At 75%+ utilisation, opex compresses as revenue grows without proportional cost increases. This inflection point is roughly 50-60% utilisation for most clusters.
Below that, you're losing money on every GPU-hour sold after accounting for power, staffing, and maintenance. Many operators failed in 2023-2024 because they built capacity during peak-demand, watched demand soften, and couldn't cover opex at 30-40% utilisation. Scale to 75%+ utilisation or collapse.
Power costs dominate opex (30-40%), varying 3-5x by geography; location choice is a strategic cost lever
Staffing represents 10-15% of opex with a high fixed-cost floor; remote operations help but require automation investment
Cooling technology choice (air vs liquid) affects both capex and opex; cold climates offer 3-5% ongoing savings
Operational leverage kicks in hard around 60-75% utilization; below that, unit economics collapse regardless of hardware cost
Many 2023-2024 casualties failed not on capex but on opex; building capacity into softening demand creates unsustainable cost structures