GPU VPS for LLM inference — L4 vs L40S vs H100

If you're serving LLM inference in production, the choice of GPU is the single biggest variable affecting cost and performance. NVIDIA's current data center lineup gives you three relevant options for VPS-scale deployments: L4, L40S, and H100. This post is the practitioner's comparison — what each is good for, what they cost, and how to think about the tradeoff.

The contenders, briefly

	L4	L40S	H100 SXM5
Memory	24 GB GDDR6	48 GB GDDR6	80 GB HBM3
FP16 TFLOPs	30	91	989
FP8 TFLOPs	121	366	1,979
Memory bandwidth	300 GB/s	864 GB/s	3,000 GB/s
Power	72 W	350 W	700 W
FranceVPS hourly price	€0.40	€1.40	€3.20

L4: cost-efficient inference

The L4 is the smallest of the three and the cheapest. 24 GB is enough to host:

Llama 3 8B in FP16 (16 GB) with room for KV cache
Mistral 7B Instruct in FP16
Quantized 13B models (8-bit GPTQ, 4-bit AWQ)
Stable Diffusion XL (10 GB model + workspace)
Embeddings models (BGE, GTE, multilingual-e5) — these fit in 4-6 GB easily

What it can't do: serve large models without aggressive quantization, fine-tune (memory pressure during gradient computation overwhelms 24 GB), or process very long contexts (the KV cache for 32K+ tokens eats memory fast).

Throughput for Llama 3 8B with vLLM, single L4: roughly 1,400 tokens/sec aggregate across concurrent requests. At our €0.40/hour price, that's €0.40 / 1400 / 3600 = €0.00008 per 1K output tokens. Hard to beat for pure inference.

Use L4 when: you're serving small-to-medium models, you care about cost per token, you don't need ultra-low first-token latency, you're building embeddings infrastructure.

L40S: the workhorse

The L40S is a 350W GPU with 48 GB of GDDR6 memory and substantially more compute than the L4. It's our default recommendation for most production LLM workloads.

What 48 GB enables:

Llama 3 70B in 4-bit quantization (35-40 GB) with usable KV cache
Mixtral 8×7B in 4-bit (24 GB) with comfortable margin
FP16 inference of 13-22B models with full KV cache
Mid-scale fine-tuning (LoRA on 13B base models)
SDXL with multiple ControlNets loaded simultaneously

Throughput for Llama 3 70B Q4, vLLM, single L40S: ~620 tokens/sec aggregate. At €1.40/hour, that's roughly €0.00063 per 1K output tokens for a 70B model — competitive with API pricing for most providers, and you don't pay per-input-token.

The L40S is also the right choice if you want to do continuous fine-tuning of mid-size models. With 48 GB you can LoRA-train a 13B model with batch size 4-8 and reasonable context lengths.

Use L40S when: you need 30-70B model inference, you want headroom for batching, you're doing fine-tuning of mid-size models, you need flexibility across multiple model sizes.

H100: serious training and high-throughput inference

The H100 is in a different league. 80 GB HBM3 memory with 3 TB/sec bandwidth (3.5× the L40S), plus the Transformer Engine that accelerates attention computation by ~6× via FP8 support.

Where the H100 shines:

Full-precision (FP16) inference of 70B models with full context window
Multi-LoRA serving (loading dozens of LoRA adapters and routing per-request)
Continuous training and fine-tuning of large models
Throughput-critical inference where you'd otherwise need 4-8× L40S

Throughput for Llama 3 70B FP8, vLLM with TGI engine, single H100: ~2,400 tokens/sec aggregate. At €3.20/hour, that's €0.00037 per 1K output tokens for FP8 70B — significantly cheaper per token than L40S Q4 if you can saturate the GPU.

The "if you can saturate" caveat matters. An H100 sitting at 30% utilization is more expensive than an L40S at 80% utilization for the same workload. The H100 makes economic sense when you have enough load to keep it busy.

Use H100 when: you're training or fine-tuning large models, you have sustained high-throughput inference workloads, you need FP8 acceleration, you want maximum tokens/sec per VPS.

The decision framework

Three questions:

What model size are you serving? <13B → L4 or L40S. 13-70B → L40S or H100. 70B+ FP16 → H100.
What's your sustained load? Sporadic (under 30% utilization) → L4 or L40S. Sustained → upgrade tier becomes more efficient.
What's your latency target? First-token under 100ms with large models → H100 (FP8 + Transformer Engine). Otherwise L4/L40S work.

The vLLM advantage

Whatever GPU you pick, use vLLM (or TGI, or TensorRT-LLM) instead of bare PyTorch. The continuous batching architecture means you process multiple requests simultaneously without idle GPU time between requests. Throughput improvements of 5-10× are typical compared to naive serve loops.

Our standard GPU VPS images come with vLLM pre-installed. Spinning up a Llama 3 70B inference endpoint is roughly 4 commands once the VPS is provisioned.

Real-world cost comparison

For a workload of ~5M output tokens per day on a 70B model:

OpenAI GPT-4o: ~€75/day
Anthropic Claude Sonnet: ~€60/day
L40S running Llama 3 70B Q4: €33/day if utilized 24/7, €15/day if scaled down at night
H100 running Llama 3 70B FP8: €77/day if utilized 24/7, but you have headroom for 4-5× the volume at the same cost

Self-hosted is cheaper at scale. The crossover point depends on your token volume — usually somewhere between 1M and 5M tokens per day, depending on which GPU and which API you'd otherwise use.

GPULLMInferencevLLMNVIDIA