2026 Open-source LLM Benchmark: Gemma 4 vs Qwen 3.5 on L40S and RTX PRO 6000

TL;DR

We benchmarked 4 open-source LLMs on two GPUs — the Scaleway L40S (48 GB VRAM, ~€450/mo) and the NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM, ~€900/mo) — under real Eridia production conditions: a ~6,000-token business system prompt, 5 declared tools, and realistic French-language queries.

The verdict: Google's Gemma 4 models win decisively. On the L40S they answer in 155 ms at 16 tokens/s; on the RTX PRO 6000, in 61 ms at 40 tokens/s — 2.5x faster. Alibaba's Qwen3.5 models are disqualified for interactive chat: their "thinking" mode, which cannot be turned off, forces a 2-to-11-second wait before the first word appears.

Why self-host your AI in 2026?

For a European SMB handling sensitive documents — contracts, HR data, healthcare, finance — sending every request to OpenAI or Anthropic raises three problems:

Data sovereignty — your documents transit through US servers, subject to the Cloud Act
Unpredictable cost — with 20 active users, the API bill can exceed €2,000/month and varies every month
Dependency — a pricing change, an outage, or a terms-of-service update, and your business tool stops working

Self-hosting answers all three: data stays in your datacenter (or with a European host like Scaleway), the cost is fixed, and you keep full control. One question remains: which model and which GPU? That is exactly what this benchmark answers.

The test bench

Two GPUs representative of the market

	Scaleway L40S	RTX PRO 6000 Blackwell
VRAM	48 GB GDDR6X	96 GB GDDR7
Memory bandwidth	864 GB/s	~1,600 GB/s
Architecture	Ada Lovelace (2023)	Blackwell (2025)
TDP	350 W	600 W
Cloud price	~€450/mo (Scaleway)	~€900/mo (RunPod)

The L40S is the most accessible GPU for serious LLM inference in Europe: 48 GB of VRAM fits models up to ~35B parameters in FP8. The RTX PRO 6000 Blackwell, with double the VRAM and nearly double the bandwidth, is the new reference point for a cloud-API-grade experience.

Four models, two architectures

Model	Architecture	Parameters	Context	Vendor
Gemma 4 31B-it	Dense	31B (all active)	128K	Google DeepMind
Gemma 4 26B-A4B-it	MoE	26B (~4B active)	128K	Google DeepMind
Qwen3.5 27B	Dense + thinking	27B (all active)	128K	Alibaba Cloud
Qwen3.5 35B-A3B	MoE + thinking	35B (~3B active)	128K	Alibaba Cloud

Dense or MoE? A dense model uses all of its parameters for every token. An MoE (Mixture of Experts) routes each token to a subset of experts only (~4B out of 26B for Gemma): less compute per token, but all weights must still fit in VRAM.

FP8 quantization: fitting 31B into 48 GB

A 31-billion-parameter model at full precision (FP32) would weigh ~124 GB. Quantization reduces weight precision to compress the model:

Format	Bits/param	Size for 31B	Quality
FP32	32	~124 GB	Reference (training)
FP16 / BF16	16	~62 GB	Near identical
FP8 (our choice)	8	~31 GB	< 1% degradation
INT4 / GPTQ	4	~16 GB	Noticeable degradation

FP8 is the 2026 sweet spot: half the size of FP16, near-zero degradation in academic benchmarks, native hardware acceleration on Ada Lovelace and Blackwell, and single-flag vLLM support. All four models ship native FP8 checkpoints on Hugging Face.

Methodology: real conditions, not an academic benchmark

Most LLM benchmarks measure isolated tasks (MMLU, HumanEval…). We wanted to measure what users actually experience inside Eridia:

Full system prompt — the real Eridia prompt (~6,000 tokens): business instructions, user context, safety rules
5 declared tools — file search, code execution, document creation, meeting search, user interaction (OpenAI function calling format)
7 typical French-language queries — from a simple "Hello" to complex legal analysis
3 runs per query, measured in streaming: TTFT (Time To First Token), generation tokens/s, tool-call reliability

#	Query	Category
1	"Hello, how can you help me?"	Simple chat
2	"Find my recent contracts, the PDFs uploaded this week"	Tool call
3	"Summarize my last meeting with the product team"	Tool call
4	"A 3-year non-compete clause — is it valid?"	Reasoning
5	"Create an action plan from this morning's meeting"	Reasoning + tool
6	"Explain GDPR for health data in detail"	Long output
7	Multi-turn conversation about Eridia vs ChatGPT	Multi-turn

The primary inference engine is vLLM (v0.19+, native FP8, 24K context). We also tested llama.cpp + TurboQuant on the L40S to compare both approaches — results below.

Results

Overview

Averages across the 7 queries, 3 runs each, with vLLM:

Model	GPU	Avg TTFT	Tokens/s	Tool calls	Verdict
Gemma 4 26B-A4B	RTX PRO 6000	61 ms	39.5	100%	Outright champion
Gemma 4 31B	RTX PRO 6000	64 ms	34.5	100%	Best quality
Gemma 4 31B	L40S	195 ms	15.8	100%	Best value
Gemma 4 26B-A4B	L40S	260 ms	16.3	100%	Good quality/VRAM ratio
Qwen3.5 27B	RTX PRO 6000	3.5 s	28.4	100%	Prohibitive TTFT
Qwen3.5 35B-A3B	RTX PRO 6000	4.3 s	24.0	100%	Prohibitive TTFT
Qwen3.5 27B	L40S	8.4 s	8.9	100%	Too slow for chat
Qwen3.5 35B-A3B	L40S	10.5 s	9.1	100%	Too slow for chat

Three immediate takeaways:

The Gemma 4 models are alone in the race for interactive chat: first token under 300 ms everywhere, and under 70 ms on Blackwell.
Tool calling is 100% reliable across all 8 configurations — no longer a differentiator in 2026.
The Qwen3.5 models pay for their thinking mode: generation itself is fine, but users stare at an empty screen for several seconds before every answer.

What Blackwell buys you

Model	TTFT (L40S → Blackwell)	Tokens/s (L40S → Blackwell)	Gain
Gemma 4 26B-A4B	260 ms → 61 ms	16.3 → 39.5	×2.4
Gemma 4 31B	195 ms → 64 ms	15.8 → 34.5	×2.2
Qwen3.5 27B	8.4 s → 3.5 s	8.9 → 28.4	×3.2
Qwen3.5 35B-A3B	10.5 s → 4.3 s	9.1 → 24.0	×2.6

The RTX PRO 6000 delivers an average ×2.6 throughput gain and divides TTFT by 3 to 4. The explanation is direct: memory bandwidth is the limiting factor of LLM inference, and the Blackwell's GDDR7 (~1,600 GB/s vs 864 GB/s) nearly doubles it.

Zooming in: the champion's profile

Per-query detail for the best configuration, Gemma 4 26B-A4B on RTX PRO 6000:

Query	TTFT	Tokens/s	Tokens generated
Simple chat	61 ms	39.4	375
File search (tool)	189 ms	33.6	96
Meeting summary (tool)	190 ms	37.8	99
Legal reasoning	166 ms	40.8	874
Action plan (tool)	173 ms	41.4	202
GDPR explanation (long output)	261 ms	38.4	1,210
Multi-turn	62 ms	43.3	568

Two patterns hold across every configuration we tested: tool-call queries have a TTFT roughly 3x higher (the model must decide to call the tool before emitting anything), and throughput stays stable even on long 1,200+ token outputs. The Qwen3.5 models, by contrast, show TTFT spikes up to 28 seconds on reasoning queries — thinking mode runs away precisely where users expect a fast answer.

vLLM or llama.cpp?

We replayed the same queries on the same L40S with llama.cpp (TurboQuant fork, GGUF models) to compare the two reference inference engines:

Metric	vLLM FP8	llama.cpp Q4_K_M
TTFT (first token)	195–260 ms	3.6–11 s
Generation tokens/s	15–16	up to 130
VRAM (Gemma 4 26B MoE)	~40 GB	~17 GB
Multi-user	Excellent (continuous batching)	Limited
Setup	Python stack	Single compiled binary

The tradeoff is fundamental and symmetrical:

vLLM is optimized for prompt processing: specialized CUDA kernels mean the first token arrives almost instantly, even with a 6,000-token system prompt. In exchange, generation tops out around 16 tokens/s on the L40S.
llama.cpp is optimized for sequential generation: pure C++, minimal overhead, up to 130 tokens/s — 8x faster than vLLM on the same GPU. But prefill is far less optimized: 3 to 11 seconds before the first token.

Add the VRAM argument: in GGUF Q4_K_M with TurboQuant's compressed KV cache (~6% speed penalty), the same model fits in 17 GB instead of 40 — enough headroom to target 64K+ token contexts without saturating the GPU.

In practice: interactive chat and multi-user workloads → vLLM; long document generation, batch processing, very long contexts or tight VRAM budgets → llama.cpp + TurboQuant.

Analysis: why such gaps?

Qwen3.5's thinking mode, a trap for chat

The Qwen3.5 models embed internal reasoning (similar to OpenAI's o1/o3): before each answer, the model generates hidden reasoning tokens that consume GPU time while displaying nothing. This mode is enabled by default and cannot be disabled through the standard API. For batch analysis, fine; for a conversational assistant, it is a deal-breaker.

MoE vs dense: an advantage that only shows with bandwidth

On the L40S, the MoE Gemma 4 26B-A4B is not faster than the dense 31B: inference there is bound by reading weights from memory, and an MoE must read all its weights even if only a fraction is active. On the RTX PRO 6000, the memory bottleneck loosens and the MoE's compute advantage emerges: 39.5 tokens/s vs 34.5 for the dense model — a lead that will grow with multi-user workloads thanks to the lower compute load per request.

What does it cost?

Solution	Monthly cost	Cost for 20 users
Self-hosted L40S (Scaleway)	~€450 fixed	~€22/user
RTX PRO 6000 (RunPod)	~€900 fixed	~€45/user
OpenAI GPT-4o API	Variable	~€50–150/user*
Anthropic Claude API	Variable	~€50–200/user*

Estimate for ~500 requests/user/month with a comparable system prompt.

At 20 users, self-hosting costs 2 to 4 times less than the APIs — with a fixed, predictable cost and data that never leaves Europe. That is exactly the model behind Eridia's custom deployment offer.

Our recommendations

Top pick — Gemma 4 26B-A4B (MoE) on RTX PRO 6000, ~€900/mo. 40 tokens/s, first token in 61 ms: an experience indistinguishable from premium cloud APIs, with full sovereignty. The 96 GB of VRAM leaves headroom for long contexts and multi-user batching. Ideal for 10–50 users.

Tight budget — Gemma 4 31B-it on Scaleway L40S, ~€450/mo. The best answer quality at this price: sub-200 ms TTFT, 16 tokens/s, reliable tool calling, excellent French. Perfect to start with 10–30 users (~€22/user/month), hosted in France.

Maximum quality — Gemma 4 31B-it on RTX PRO 6000, ~€900/mo. If reasoning depth matters more than raw speed: 35 tokens/s, 64 ms TTFT, and the most structured answers of the panel.

Batch workloads and long contexts — Gemma 4 26B-A4B in GGUF Q4_K_M + TurboQuant on L40S, ~€450/mo. 130 tokens/s with only 17 GB of VRAM: the right choice for automation pipelines (summaries, data extraction, report generation) where TTFT is not critical.

Want this stack without managing it yourself? Eridia installs these models turnkey on your infrastructure, with native security and GDPR compliance — let's talk.

What's next?

Multi-user tests — measuring degradation with 5, 10, 20 concurrent requests
Long-context benchmark — f16 vs turbo4 KV cache at 64K and 128K tokens
New models — Mistral Small 3.1 and Llama 4 Scout as soon as they are available
Hybrid setup — vLLM for chat + llama.cpp for batch on the same GPU

Benchmark run with Eridia v2 — eridia.ai. Engines: vLLM 0.19+ (native FP8), llama.cpp TurboQuant fork (GGUF Q4_K_M/Q8_0, turbo4 KV cache). GPUs: Scaleway L40S (48 GB, 864 GB/s), RunPod RTX PRO 6000 Blackwell (96 GB, ~~1,600 GB/s). 24K context, 7 French-language queries, 3 runs per query, full system prompt (~~6,000 tokens). Detailed methodology and scripts available on request.

Enterprise AI: which open-source model and GPU for a European SMB?