Self-Hosted LLM Costs 2026 | Pricing Comparison – SitePoint

Spread the love

Share this article
7 Day Free Trial. Cancel Anytime.
Self-hosted LLM costs have shifted substantially since 2024. Next-generation GPUs became available, secondary hardware markets drove prices down, and electricity rates updated across major hosting regions. For engineers and technical decision-makers evaluating whether to run open-weight models on infrastructure they control, the calculus depends on a set of measurable variables that this guide breaks down.
Three cost tiers define the decision. Personal or hobbyist hardware suits experimentation and low-throughput workloads. Dedicated servers and colocation arrangements serve production inference at predictable cost. Cloud GPU instances offer flexibility but carry premium hourly rates. Each tier has a distinct total cost of ownership (TCO) profile, and the right choice depends on token volume, latency requirements, data residency constraints, and available engineering capacity.
The goal here is straightforward: provide real dollar figures for each tier, a framework for calculating TCO, and a clear break-even analysis against API providers. The TCO formula and usage tables below allow teams to plug in their own variables and model outcomes before committing budget.
The 2026 cloud GPU market reflects NVIDIA shipping Blackwell-generation hardware alongside still-available Hopper-class instances. On-demand pricing for NVIDIA H200 instances (141 GB HBM3e) runs approximately $4.50 to $6.00 per hour on AWS, GCP, and Azure, with Lambda Labs and CoreWeave offering rates between $3.50 and $4.50 per hour for equivalent configurations. NVIDIA B200 instances, now available in limited regions, command $6.00 to $8.50 per hour on-demand from major hyperscalers, while GB200 NVL2 configurations (2-GPU module; confirm 384 GB combined HBM3e per NVIDIA product page before procurement) price between $10.00 and $14.00 per hour on-demand.
Note: All cloud GPU prices cited here reflect estimates as of mid-2026. Rates change frequently—consult AWS, GCP, Azure, CoreWeave, and Lambda Labs pricing pages directly before making procurement decisions. On-demand H200 and B200 instance availability is capacity-constrained; production workloads should use reserved instances or maintain fallback capacity.
Reserved pricing changes the picture. One-year commitments typically reduce rates by 30% to 40%, and three-year reservations by 50% to 60%. For running a 70B-parameter model at moderate throughput (~30 to 50 tokens per second for concurrent users), an H200 instance suffices. At a reserved one-year rate of around $2.80 per hour on CoreWeave, monthly cost lands at $2,016 for continuous operation.
Bare-metal GPU server rentals from providers like Hetzner, OVH, and specialized GPU hosts offer a middle path. Monthly pricing for a server with a single NVIDIA A6000 Ada (48 GB VRAM) runs $400 to $700, while configurations with dual or quad H100 SXM GPUs range from $4,000 to $10,000 per month depending on provider and contract length.
Colocation introduces different math. Rack space in Tier 3 data centers costs $200 to $500 per month per 1U to 4U of space, plus metered power and network fees. You purchase GPUs outright on the capital expenditure path: NVIDIA RTX 5090 cards sit around $2,000 to $2,500 retail, A6000 Ada cards around $4,500 to $5,500, and used H100 SXM GPUs on the secondary market between $15,000 and $20,000 (down significantly from $25,000+ in early 2025). Secondary-market GPUs carry no manufacturer warranty; budget for burn-in testing and potential early failure replacement. Amortized over three years, a dual-H100 setup’s hardware cost alone breaks down to ~$900 to $1,200 per month before power and colocation fees.
Consumer-tier self-hosting now handles models that were impractical on local hardware in 2024. The NVIDIA RTX 5090 (32 GB GDDR7X) runs quantized models up to approximately 34B parameters at 4-bit quantization, producing around 30 to 50 tokens per second at that model size. Dual-GPU setups using PCIe (NVLink is unavailable on consumer RTX GPUs) push that to around 70B with 4-bit quantization, using software-level model sharding (e.g., llama.cpp layer splitting across GPUs). The Apple Mac Studio with M4 Ultra and 192 GB unified memory can run quantized 70B models and even 120B+ parameter models at 4-bit quantization (~65 GB), leaving headroom; expect 2 to 5 tokens per second throughput at this size. Discrete dual-GPU setups outperform the Mac Studio on raw throughput at 70B by roughly 5 to 10x, but the Mac Studio handles larger models that simply don’t fit in 64 GB of GDDR7X.
Upfront costs range from ~$2,000 for a single RTX 5090 build, $5,000 to $8,000 for a dual-GPU workstation, and $8,000 to $15,000+ for a maxed-out Mac Studio configuration.
Self-hosting has advantages you won’t see in per-token cost alone: full data residency control, eliminated wide-area network latency for on-premises deployments, and the ability to fine-tune or run custom model variants without per-query markup.
Power consumption under inference load varies significantly by GPU. An H100 SXM has a TDP of 700 W at full load; actual draw ranges from around 400 to 500 W at partial utilization up to 700 W under sustained maximum throughput. For worst-case TCO planning, use 700 W per GPU. A B200 ranges from 600 to 1,000 watts depending on workload density. Consumer RTX 5090 cards draw ~300 to 450 watts under load.
At the U.S. average commercial electricity rate of approximately $0.13 per kWh (and $0.25 to $0.35 per kWh in Western Europe), running two H100 GPUs at an average 500 watts each yields 720 kWh per month for the pair (360 kWh per GPU), costing about $47 per GPU or $93.60 per month for both at U.S. rates. Applying a PUE of 1.4 increases that to around $131 per month. Home cooling overhead effectively increases electricity cost by 50 to 80% during warm months, analogous to a PUE of 1.5 to 1.8.
Electricity math check: 2 GPUs x 500 W average x 720 hours/month (24 hr x 30 days) = 720,000 Wh = 720 kWh total for both GPUs. At $0.13/kWh, that is $93.60 for both, or $46.80 per GPU. With a PUE of 1.4: 720 kWh x 1.4 = 1,008 kWh x $0.13 = $131.04. When modeling your own TCO, calculate from your measured GPU wattage and local electricity rate rather than using these approximate figures directly.
The inference software stack itself is free and open source. vLLM, Hugging Face Text Generation Inference (TGI), and llama.cpp carry no licensing costs but demand engineering expertise to deploy, optimize, and maintain. Observability tooling (Prometheus, Grafana, or hosted alternatives) adds $0 to $200 per month depending on scale and chosen tools.
The real hidden cost is human time. Deploying, monitoring, patching, updating models, and responding to incidents require DevOps or MLOps attention. For a production system, allocating 20% to 30% of a senior engineer’s time translates to roughly $3,000 to $6,000 per month in staffing cost. The low end reflects Eastern Europe or APAC compensation levels; the high end reflects U.S. tier-1 metro fully loaded rates. Teams that underestimate this line item consistently blow their TCO projections.
API pricing as of mid-2026 spans a wide range. OpenAI GPT-4.1 charges approximately $2.00 per 1M input tokens and $8.00 per 1M output tokens. Anthropic Claude 4 Sonnet sits at ~$3.00/$15.00 (input/output per 1M tokens). Google Gemini 2.5 Pro prices at around $1.25/$10.00. Open-model API providers undercut these substantially: Together AI and Fireworks serve Llama 4 70B at ~$0.20 to $0.60 per 1M tokens (blended), and Groq delivers inference on optimized hardware at similar rates with notably lower latency.
Note: All API prices cited here are point-in-time estimates. Verify at each provider’s pricing page before making decisions: platform.openai.com/docs/pricing, anthropic.com/pricing, ai.google.dev/pricing, together.ai/pricing.
Translating these into monthly budgets at defined usage levels:
*Assumes approximately 70% input / 30% output token ratio. Adjust API cost estimates based on your actual token mix, as most providers charge different rates for input and output tokens.
The crossover point depends heavily on which API serves as the comparison. Against frontier closed-source models (GPT-4.1, Claude 4), self-hosting on a reserved cloud GPU breaks even at roughly 2M to 5M tokens per day compared to GPT-4.1 tier pricing. That range widens or narrows based on your input/output token ratio and whether you use reserved or on-demand instances; verify against your actual mix using the TCO formula below. Against open-model API providers like Together AI, the break-even shifts dramatically higher, often to 50M+ tokens per day, because those providers already operate optimized infrastructure at scale with thin margins.
Self-hosting has advantages you won’t see in per-token cost alone: full data residency control, eliminated wide-area network latency for on-premises deployments, and the ability to fine-tune or run custom model variants without per-query markup. These factors can justify the investment below the pure cost break-even point. Conversely, for bursty or unpredictable workloads, rapidly evolving model requirements, or teams without spare MLOps capacity, APIs remain more cost-effective and operationally simpler.
TCO per month equals the sum of hardware amortization (purchase price divided by lifespan in months), electricity, cooling overhead, networking fees, fractional staffing allocation, and software or tooling costs. Expressed as a formula:
Consider a concrete scenario: a mid-size SaaS company running Llama 4 70B for automated customer support, processing around 5M tokens per day. Using a colocation setup with two owned H100 SXM GPUs purchased at $18,000 each on the secondary market (a midpoint estimate; actual secondary-market prices range $15,000 to $20,000 per card, yielding hardware amortization of $833 to $1,111/month), amortized over 36 months, the hardware component is $1,000 per month. Electricity for both GPUs at 500 W average, 720 hours/month, PUE of 1.4, and $0.13/kWh comes to about $131 per month. Colocation and power fees add ~$700 per month. Staffing at 25% of a senior engineer ($192,000/year fully loaded) comes to $4,000 per month. Software and observability tooling adds $100. Monthly TCO: ~$5,931, yielding a cost of about $0.40 per 1M tokens.
Against GPT-4.1 at that volume, monthly API cost would exceed $7,500 to $15,000, yielding a break-even timeline of roughly 6 to 7 months for the $36,000 hardware investment. Against Together AI serving the same model, the API cost at $0.40 per 1M blended tokens would be ~$600 per month, making self-hosting considerably more expensive at this volume. In that case, self-hosting never breaks even on cost alone.
The TCO formula above accepts inputs for GPU choice, local electricity rate, expected utilization percentage, staffing cost allocation, and projected daily token volume. Teams should model at least three scenarios (optimistic utilization, realistic utilization, and low utilization) to stress-test assumptions before procurement decisions.
To calculate cost per 1M tokens: divide monthly TCO by total monthly tokens (in millions). For example: cost_per_million_tokens(tco=5931, daily_tokens_millions=5.0) yields ~$0.40. To find the break-even month versus a specified API alternative: break_even_month(P_hardware=36000, monthly_opex=4931, monthly_api_cost=11250) yields approximately 5.7 months against GPT-4.1 midpoint pricing.
Self-hosting delivers clear economic advantage at sustained volumes above 10M tokens per day when compared to frontier closed-source APIs. It becomes the obvious choice when strict data residency or privacy regulations apply. Fine-tuned or custom models that must run without per-query markup push the decision further toward owned infrastructure, as do latency requirements that demand on-premises or single-tenant deployment.
Variable or low usage (under 2M to 5M tokens per day against frontier APIs) rarely justifies the fixed costs of self-hosting. Teams that need access to the latest frontier closed-source models have no self-hosting path for those weights, and limited DevOps or MLOps capacity turns the operational burden into a genuine risk to uptime and security posture.
Teams that underestimate this line item consistently blow their TCO projections.
Self-hosting economics improved between 2024 and 2026: GPU prices fell, open-weight model quality closed the gap for many production use cases (particularly text-based reasoning and instruction following), and inference tooling matured. But the economics only work above a clear volume threshold, and that threshold shifts depending on whether the comparison is frontier APIs or open-model API providers already running the same weights. The TCO formula above exists so you can find your own crossover point before committing capital.
Sharing our passion for building incredible internet things.
Get the freshest news and resources for developers, designers and digital creators in your inbox each week

source

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top