TL;DR
We benchmarked 4 open-source LLMs on two GPUs — the Scaleway L40S (48 GB VRAM, ~€450/mo) and the NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM, ~€900/mo) — under real Eridia production conditions: a ~6,000-token business system prompt, 5 declared tools, and realistic French-language queries.
The verdict: Google's Gemma 4 models win decisively. On the L40S they answer in 155 ms at 16 tokens/s; on the RTX PRO 6000, in 61 ms at 40 tokens/s — 2.5x faster. Alibaba's Qwen3.5 models are disqualified for interactive chat: their "thinking" mode, which cannot be turned off, forces a 2-to-11-second wait before the first word appears.
Why self-host your AI in 2026?
For a European SMB handling sensitive documents — contracts, HR data, healthcare, finance — sending every request to OpenAI or Anthropic raises three problems:
- Data sovereignty — your documents transit through US servers, subject to the Cloud Act
- Unpredictable cost — with 20 active users, the API bill can exceed €2,000/month and varies every month
- Dependency — a pricing change, an outage, or a terms-of-service update, and your business tool stops working
Self-hosting answers all three: data stays in your datacenter (or with a European host like Scaleway), the cost is fixed, and you keep full control. One question remains: which model and which GPU? That is exactly what this benchmark answers.
The test bench
Two GPUs representative of the market
| Scaleway L40S | RTX PRO 6000 Blackwell |
|---|---|---|
VRAM | 48 GB GDDR6X | 96 GB GDDR7 |
Memory bandwidth | 864 GB/s | ~1,600 GB/s |
Architecture | Ada Lovelace (2023) | Blackwell (2025) |
TDP | 350 W | 600 W |
Cloud price | ~€450/mo (Scaleway) | ~€900/mo (RunPod) |
The L40S is the most accessible GPU for serious LLM inference in Europe: 48 GB of VRAM fits models up to ~35B parameters in FP8. The RTX PRO 6000 Blackwell, with double the VRAM and nearly double the bandwidth, is the new reference point for a cloud-API-grade experience.
Four models, two architectures
Model | Architecture | Parameters | Context | Vendor |
|---|---|---|---|---|
Gemma 4 31B-it | Dense | 31B (all active) | 128K | Google DeepMind |
Gemma 4 26B-A4B-it | MoE | 26B (~4B active) | 128K | Google DeepMind |
Qwen3.5 27B | Dense + thinking | 27B (all active) | 128K | Alibaba Cloud |
Qwen3.5 35B-A3B | MoE + thinking | 35B (~3B active) | 128K | Alibaba Cloud |
Dense or MoE? A dense model uses all of its parameters for every token. An MoE (Mixture of Experts) routes each token to a subset of experts only (~4B out of 26B for Gemma): less compute per token, but all weights must still fit in VRAM.
FP8 quantization: fitting 31B into 48 GB
A 31-billion-parameter model at full precision (FP32) would weigh ~124 GB. Quantization reduces weight precision to compress the model:
Format | Bits/param | Size for 31B | Quality |
|---|---|---|---|
FP32 | 32 | ~124 GB | Reference (training) |
FP16 / BF16 | 16 | ~62 GB | Near identical |
FP8 (our choice) | 8 | ~31 GB | < 1% degradation |
INT4 / GPTQ | 4 | ~16 GB | Noticeable degradation |
FP8 is the 2026 sweet spot: half the size of FP16, near-zero degradation in academic benchmarks, native hardware acceleration on Ada Lovelace and Blackwell, and single-flag vLLM support. All four models ship native FP8 checkpoints on Hugging Face.
Methodology: real conditions, not an academic benchmark
Most LLM benchmarks measure isolated tasks (MMLU, HumanEval…). We wanted to measure what users actually experience inside Eridia:
- Full system prompt — the real Eridia prompt (~6,000 tokens): business instructions, user context, safety rules
- 5 declared tools — file search, code execution, document creation, meeting search, user interaction (OpenAI function calling format)
- 7 typical French-language queries — from a simple "Hello" to complex legal analysis
- 3 runs per query, measured in streaming: TTFT (Time To First Token), generation tokens/s, tool-call reliability
# | Query | Category |
|---|---|---|
1 | "Hello, how can you help me?" | Simple chat |
2 | "Find my recent contracts, the PDFs uploaded this week" | Tool call |
3 | "Summarize my last meeting with the product team" | Tool call |
4 | "A 3-year non-compete clause — is it valid?" | Reasoning |
5 | "Create an action plan from this morning's meeting" | Reasoning + tool |
6 | "Explain GDPR for health data in detail" | Long output |
7 | Multi-turn conversation about Eridia vs ChatGPT | Multi-turn |
The primary inference engine is vLLM (v0.19+, native FP8, 24K context). We also tested llama.cpp + TurboQuant on the L40S to compare both approaches — results below.
Results
Overview
Averages across the 7 queries, 3 runs each, with vLLM:
Model | GPU | Avg TTFT | Tokens/s | Tool calls | Verdict |
|---|---|---|---|---|---|
Gemma 4 26B-A4B | RTX PRO 6000 | 61 ms | 39.5 | 100% | Outright champion |
Gemma 4 31B | RTX PRO 6000 | 64 ms | 34.5 | 100% | Best quality |
Gemma 4 31B | L40S | 195 ms | 15.8 | 100% | Best value |
Gemma 4 26B-A4B | L40S | 260 ms | 16.3 | 100% | Good quality/VRAM ratio |
Qwen3.5 27B | RTX PRO 6000 | 3.5 s | 28.4 | 100% | Prohibitive TTFT |
Qwen3.5 35B-A3B | RTX PRO 6000 | 4.3 s | 24.0 | 100% | Prohibitive TTFT |
Qwen3.5 27B | L40S | 8.4 s | 8.9 | 100% | Too slow for chat |
Qwen3.5 35B-A3B | L40S | 10.5 s | 9.1 | 100% | Too slow for chat |
Three immediate takeaways:
- The Gemma 4 models are alone in the race for interactive chat: first token under 300 ms everywhere, and under 70 ms on Blackwell.
- Tool calling is 100% reliable across all 8 configurations — no longer a differentiator in 2026.
- The Qwen3.5 models pay for their thinking mode: generation itself is fine, but users stare at an empty screen for several seconds before every answer.
What Blackwell buys you
Model | TTFT (L40S → Blackwell) | Tokens/s (L40S → Blackwell) | Gain |
|---|---|---|---|
Gemma 4 26B-A4B | 260 ms → 61 ms | 16.3 → 39.5 | ×2.4 |
Gemma 4 31B | 195 ms → 64 ms | 15.8 → 34.5 | ×2.2 |
Qwen3.5 27B | 8.4 s → 3.5 s | 8.9 → 28.4 | ×3.2 |
Qwen3.5 35B-A3B | 10.5 s → 4.3 s | 9.1 → 24.0 | ×2.6 |
The RTX PRO 6000 delivers an average ×2.6 throughput gain and divides TTFT by 3 to 4. The explanation is direct: memory bandwidth is the limiting factor of LLM inference, and the Blackwell's GDDR7 (~1,600 GB/s vs 864 GB/s) nearly doubles it.
Zooming in: the champion's profile
Per-query detail for the best configuration, Gemma 4 26B-A4B on RTX PRO 6000:
Query | TTFT | Tokens/s | Tokens generated |
|---|---|---|---|
Simple chat | 61 ms | 39.4 | 375 |
File search (tool) | 189 ms | 33.6 | 96 |
Meeting summary (tool) | 190 ms | 37.8 | 99 |
Legal reasoning | 166 ms | 40.8 | 874 |
Action plan (tool) | 173 ms | 41.4 | 202 |
GDPR explanation (long output) | 261 ms | 38.4 | 1,210 |
Multi-turn | 62 ms | 43.3 | 568 |
Two patterns hold across every configuration we tested: tool-call queries have a TTFT roughly 3x higher (the model must decide to call the tool before emitting anything), and throughput stays stable even on long 1,200+ token outputs. The Qwen3.5 models, by contrast, show TTFT spikes up to 28 seconds on reasoning queries — thinking mode runs away precisely where users expect a fast answer.
vLLM or llama.cpp?
We replayed the same queries on the same L40S with llama.cpp (TurboQuant fork, GGUF models) to compare the two reference inference engines:
Metric | vLLM FP8 | llama.cpp Q4_K_M |
|---|---|---|
TTFT (first token) | 195–260 ms | 3.6–11 s |
Generation tokens/s | 15–16 | up to 130 |
VRAM (Gemma 4 26B MoE) | ~40 GB | ~17 GB |
Multi-user | Excellent (continuous batching) | Limited |
Setup | Python stack | Single compiled binary |
The tradeoff is fundamental and symmetrical:
- vLLM is optimized for prompt processing: specialized CUDA kernels mean the first token arrives almost instantly, even with a 6,000-token system prompt. In exchange, generation tops out around 16 tokens/s on the L40S.
- llama.cpp is optimized for sequential generation: pure C++, minimal overhead, up to 130 tokens/s — 8x faster than vLLM on the same GPU. But prefill is far less optimized: 3 to 11 seconds before the first token.
Add the VRAM argument: in GGUF Q4_K_M with TurboQuant's compressed KV cache (~6% speed penalty), the same model fits in 17 GB instead of 40 — enough headroom to target 64K+ token contexts without saturating the GPU.
In practice: interactive chat and multi-user workloads → vLLM; long document generation, batch processing, very long contexts or tight VRAM budgets → llama.cpp + TurboQuant.
Analysis: why such gaps?
Qwen3.5's thinking mode, a trap for chat
The Qwen3.5 models embed internal reasoning (similar to OpenAI's o1/o3): before each answer, the model generates hidden reasoning tokens that consume GPU time while displaying nothing. This mode is enabled by default and cannot be disabled through the standard API. For batch analysis, fine; for a conversational assistant, it is a deal-breaker.
MoE vs dense: an advantage that only shows with bandwidth
On the L40S, the MoE Gemma 4 26B-A4B is not faster than the dense 31B: inference there is bound by reading weights from memory, and an MoE must read all its weights even if only a fraction is active. On the RTX PRO 6000, the memory bottleneck loosens and the MoE's compute advantage emerges: 39.5 tokens/s vs 34.5 for the dense model — a lead that will grow with multi-user workloads thanks to the lower compute load per request.
What does it cost?
Solution | Monthly cost | Cost for 20 users |
|---|---|---|
Self-hosted L40S (Scaleway) | ~€450 fixed | ~€22/user |
RTX PRO 6000 (RunPod) | ~€900 fixed | ~€45/user |
OpenAI GPT-4o API | Variable | ~€50–150/user* |
Anthropic Claude API | Variable | ~€50–200/user* |
Estimate for ~500 requests/user/month with a comparable system prompt.
At 20 users, self-hosting costs 2 to 4 times less than the APIs — with a fixed, predictable cost and data that never leaves Europe. That is exactly the model behind Eridia's custom deployment offer.
Our recommendations
Top pick — Gemma 4 26B-A4B (MoE) on RTX PRO 6000, ~€900/mo. 40 tokens/s, first token in 61 ms: an experience indistinguishable from premium cloud APIs, with full sovereignty. The 96 GB of VRAM leaves headroom for long contexts and multi-user batching. Ideal for 10–50 users.
Tight budget — Gemma 4 31B-it on Scaleway L40S, ~€450/mo. The best answer quality at this price: sub-200 ms TTFT, 16 tokens/s, reliable tool calling, excellent French. Perfect to start with 10–30 users (~€22/user/month), hosted in France.
Maximum quality — Gemma 4 31B-it on RTX PRO 6000, ~€900/mo. If reasoning depth matters more than raw speed: 35 tokens/s, 64 ms TTFT, and the most structured answers of the panel.
Batch workloads and long contexts — Gemma 4 26B-A4B in GGUF Q4_K_M + TurboQuant on L40S, ~€450/mo. 130 tokens/s with only 17 GB of VRAM: the right choice for automation pipelines (summaries, data extraction, report generation) where TTFT is not critical.
Want this stack without managing it yourself? Eridia installs these models turnkey on your infrastructure, with native security and GDPR compliance — let's talk.
What's next?
- Multi-user tests — measuring degradation with 5, 10, 20 concurrent requests
- Long-context benchmark — f16 vs turbo4 KV cache at 64K and 128K tokens
- New models — Mistral Small 3.1 and Llama 4 Scout as soon as they are available
- Hybrid setup — vLLM for chat + llama.cpp for batch on the same GPU
Benchmark run with Eridia v2 — eridia.ai. Engines: vLLM 0.19+ (native FP8), llama.cpp TurboQuant fork (GGUF Q4_K_M/Q8_0, turbo4 KV cache). GPUs: Scaleway L40S (48 GB, 864 GB/s), RunPod RTX PRO 6000 Blackwell (96 GB, ~1,600 GB/s). 24K context, 7 French-language queries, 3 runs per query, full system prompt (~6,000 tokens). Detailed methodology and scripts available on request.