If you're serving LLM inference in production, the choice of GPU is the single biggest variable affecting cost and performance. NVIDIA's current data center lineup gives you three relevant options for VPS-scale deployments: L4, L40S, and H100. This post is the practitioner's comparison — what each is good for, what they cost, and how to think about the tradeoff.
The contenders, briefly
| L4 | L40S | H100 SXM5 | |
|---|---|---|---|
| Memory | 24 GB GDDR6 | 48 GB GDDR6 | 80 GB HBM3 |
| FP16 TFLOPs | 30 | 91 | 989 |
| FP8 TFLOPs | 121 | 366 | 1,979 |
| Memory bandwidth | 300 GB/s | 864 GB/s | 3,000 GB/s |
| Power | 72 W | 350 W | 700 W |
| FranceVPS hourly price | €0.40 | €1.40 | €3.20 |
L4: cost-efficient inference
The L4 is the smallest of the three and the cheapest. 24 GB is enough to host:
- Llama 3 8B in FP16 (16 GB) with room for KV cache
- Mistral 7B Instruct in FP16
- Quantized 13B models (8-bit GPTQ, 4-bit AWQ)
- Stable Diffusion XL (10 GB model + workspace)
- Embeddings models (BGE, GTE, multilingual-e5) — these fit in 4-6 GB easily
What it can't do: serve large models without aggressive quantization, fine-tune (memory pressure during gradient computation overwhelms 24 GB), or process very long contexts (the KV cache for 32K+ tokens eats memory fast).
Throughput for Llama 3 8B with vLLM, single L4: roughly 1,400 tokens/sec aggregate across concurrent requests. At our €0.40/hour price, that's €0.40 / 1400 / 3600 = €0.00008 per 1K output tokens. Hard to beat for pure inference.
Use L4 when: you're serving small-to-medium models, you care about cost per token, you don't need ultra-low first-token latency, you're building embeddings infrastructure.
L40S: the workhorse
The L40S is a 350W GPU with 48 GB of GDDR6 memory and substantially more compute than the L4. It's our default recommendation for most production LLM workloads.
What 48 GB enables:
- Llama 3 70B in 4-bit quantization (35-40 GB) with usable KV cache
- Mixtral 8×7B in 4-bit (24 GB) with comfortable margin
- FP16 inference of 13-22B models with full KV cache
- Mid-scale fine-tuning (LoRA on 13B base models)
- SDXL with multiple ControlNets loaded simultaneously
Throughput for Llama 3 70B Q4, vLLM, single L40S: ~620 tokens/sec aggregate. At €1.40/hour, that's roughly €0.00063 per 1K output tokens for a 70B model — competitive with API pricing for most providers, and you don't pay per-input-token.
The L40S is also the right choice if you want to do continuous fine-tuning of mid-size models. With 48 GB you can LoRA-train a 13B model with batch size 4-8 and reasonable context lengths.
Use L40S when: you need 30-70B model inference, you want headroom for batching, you're doing fine-tuning of mid-size models, you need flexibility across multiple model sizes.
H100: serious training and high-throughput inference
The H100 is in a different league. 80 GB HBM3 memory with 3 TB/sec bandwidth (3.5× the L40S), plus the Transformer Engine that accelerates attention computation by ~6× via FP8 support.
Where the H100 shines:
- Full-precision (FP16) inference of 70B models with full context window
- Multi-LoRA serving (loading dozens of LoRA adapters and routing per-request)
- Continuous training and fine-tuning of large models
- Throughput-critical inference where you'd otherwise need 4-8× L40S
Throughput for Llama 3 70B FP8, vLLM with TGI engine, single H100: ~2,400 tokens/sec aggregate. At €3.20/hour, that's €0.00037 per 1K output tokens for FP8 70B — significantly cheaper per token than L40S Q4 if you can saturate the GPU.
The "if you can saturate" caveat matters. An H100 sitting at 30% utilization is more expensive than an L40S at 80% utilization for the same workload. The H100 makes economic sense when you have enough load to keep it busy.
Use H100 when: you're training or fine-tuning large models, you have sustained high-throughput inference workloads, you need FP8 acceleration, you want maximum tokens/sec per VPS.
The decision framework
Three questions:
- What model size are you serving? <13B → L4 or L40S. 13-70B → L40S or H100. 70B+ FP16 → H100.
- What's your sustained load? Sporadic (under 30% utilization) → L4 or L40S. Sustained → upgrade tier becomes more efficient.
- What's your latency target? First-token under 100ms with large models → H100 (FP8 + Transformer Engine). Otherwise L4/L40S work.
The vLLM advantage
Whatever GPU you pick, use vLLM (or TGI, or TensorRT-LLM) instead of bare PyTorch. The continuous batching architecture means you process multiple requests simultaneously without idle GPU time between requests. Throughput improvements of 5-10× are typical compared to naive serve loops.
Our standard GPU VPS images come with vLLM pre-installed. Spinning up a Llama 3 70B inference endpoint is roughly 4 commands once the VPS is provisioned.
Real-world cost comparison
For a workload of ~5M output tokens per day on a 70B model:
- OpenAI GPT-4o: ~€75/day
- Anthropic Claude Sonnet: ~€60/day
- L40S running Llama 3 70B Q4: €33/day if utilized 24/7, €15/day if scaled down at night
- H100 running Llama 3 70B FP8: €77/day if utilized 24/7, but you have headroom for 4-5× the volume at the same cost
Self-hosted is cheaper at scale. The crossover point depends on your token volume — usually somewhere between 1M and 5M tokens per day, depending on which GPU and which API you'd otherwise use.