−50% on all plans · starting at €2.48/mo · Blog·Docs·Sales

GPU VPS for LLM inference — L4 vs L40S vs H100

Choosing the right GPU for your model size, expected throughput, and budget.

If you're serving LLM inference in production, the choice of GPU is the single biggest variable affecting cost and performance. NVIDIA's current data center lineup gives you three relevant options for VPS-scale deployments: L4, L40S, and H100. This post is the practitioner's comparison — what each is good for, what they cost, and how to think about the tradeoff.

The contenders, briefly

L4L40SH100 SXM5
Memory24 GB GDDR648 GB GDDR680 GB HBM3
FP16 TFLOPs3091989
FP8 TFLOPs1213661,979
Memory bandwidth300 GB/s864 GB/s3,000 GB/s
Power72 W350 W700 W
FranceVPS hourly price€0.40€1.40€3.20

L4: cost-efficient inference

The L4 is the smallest of the three and the cheapest. 24 GB is enough to host:

What it can't do: serve large models without aggressive quantization, fine-tune (memory pressure during gradient computation overwhelms 24 GB), or process very long contexts (the KV cache for 32K+ tokens eats memory fast).

Throughput for Llama 3 8B with vLLM, single L4: roughly 1,400 tokens/sec aggregate across concurrent requests. At our €0.40/hour price, that's €0.40 / 1400 / 3600 = €0.00008 per 1K output tokens. Hard to beat for pure inference.

Use L4 when: you're serving small-to-medium models, you care about cost per token, you don't need ultra-low first-token latency, you're building embeddings infrastructure.

L40S: the workhorse

The L40S is a 350W GPU with 48 GB of GDDR6 memory and substantially more compute than the L4. It's our default recommendation for most production LLM workloads.

What 48 GB enables:

Throughput for Llama 3 70B Q4, vLLM, single L40S: ~620 tokens/sec aggregate. At €1.40/hour, that's roughly €0.00063 per 1K output tokens for a 70B model — competitive with API pricing for most providers, and you don't pay per-input-token.

The L40S is also the right choice if you want to do continuous fine-tuning of mid-size models. With 48 GB you can LoRA-train a 13B model with batch size 4-8 and reasonable context lengths.

Use L40S when: you need 30-70B model inference, you want headroom for batching, you're doing fine-tuning of mid-size models, you need flexibility across multiple model sizes.

H100: serious training and high-throughput inference

The H100 is in a different league. 80 GB HBM3 memory with 3 TB/sec bandwidth (3.5× the L40S), plus the Transformer Engine that accelerates attention computation by ~6× via FP8 support.

Where the H100 shines:

Throughput for Llama 3 70B FP8, vLLM with TGI engine, single H100: ~2,400 tokens/sec aggregate. At €3.20/hour, that's €0.00037 per 1K output tokens for FP8 70B — significantly cheaper per token than L40S Q4 if you can saturate the GPU.

The "if you can saturate" caveat matters. An H100 sitting at 30% utilization is more expensive than an L40S at 80% utilization for the same workload. The H100 makes economic sense when you have enough load to keep it busy.

Use H100 when: you're training or fine-tuning large models, you have sustained high-throughput inference workloads, you need FP8 acceleration, you want maximum tokens/sec per VPS.

The decision framework

Three questions:

  1. What model size are you serving? <13B → L4 or L40S. 13-70B → L40S or H100. 70B+ FP16 → H100.
  2. What's your sustained load? Sporadic (under 30% utilization) → L4 or L40S. Sustained → upgrade tier becomes more efficient.
  3. What's your latency target? First-token under 100ms with large models → H100 (FP8 + Transformer Engine). Otherwise L4/L40S work.

The vLLM advantage

Whatever GPU you pick, use vLLM (or TGI, or TensorRT-LLM) instead of bare PyTorch. The continuous batching architecture means you process multiple requests simultaneously without idle GPU time between requests. Throughput improvements of 5-10× are typical compared to naive serve loops.

Our standard GPU VPS images come with vLLM pre-installed. Spinning up a Llama 3 70B inference endpoint is roughly 4 commands once the VPS is provisioned.

Real-world cost comparison

For a workload of ~5M output tokens per day on a 70B model:

Self-hosted is cheaper at scale. The crossover point depends on your token volume — usually somewhere between 1M and 5M tokens per day, depending on which GPU and which API you'd otherwise use.


Related articles

Try FranceVPS today

14-day money-back guarantee. No card required to explore. Sovereign French infrastructure.