Skip to content
Benchmarks

Enterprise AI: which open-source model and GPU for a European SMB?

A full benchmark of 4 open-source LLMs (Gemma 4, Qwen 3.5) on the L40S and RTX PRO 6000 Blackwell GPUs, using the Eridia stack. TTFT, tok/s, costs, and our recommendation for European SMBs in 2026.

Eridia

Eridia

June 11, 202618 min read
Share

TL;DR

We benchmarked 4 open-source LLMs on two GPUs — the Scaleway L40S (48 GB VRAM, ~€450/mo) and the NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM, ~€900/mo) — under real Eridia production conditions: a ~6,000-token business system prompt, 5 declared tools, and realistic French-language queries.

The verdict: Google's Gemma 4 models win decisively. On the L40S they answer in 155 ms at 16 tokens/s; on the RTX PRO 6000, in 61 ms at 40 tokens/s — 2.5x faster. Alibaba's Qwen3.5 models are disqualified for interactive chat: their "thinking" mode, which cannot be turned off, forces a 2-to-11-second wait before the first word appears.

Why self-host your AI in 2026?

For a European SMB handling sensitive documents — contracts, HR data, healthcare, finance — sending every request to OpenAI or Anthropic raises three problems:

  • Data sovereignty — your documents transit through US servers, subject to the Cloud Act
  • Unpredictable cost — with 20 active users, the API bill can exceed €2,000/month and varies every month
  • Dependency — a pricing change, an outage, or a terms-of-service update, and your business tool stops working

Self-hosting answers all three: data stays in your datacenter (or with a European host like Scaleway), the cost is fixed, and you keep full control. One question remains: which model and which GPU? That is exactly what this benchmark answers.

The test bench

Two GPUs representative of the market

Scaleway L40S

RTX PRO 6000 Blackwell

VRAM

48 GB GDDR6X

96 GB GDDR7

Memory bandwidth

864 GB/s

~1,600 GB/s

Architecture

Ada Lovelace (2023)

Blackwell (2025)

TDP

350 W

600 W

Cloud price

~€450/mo (Scaleway)

~€900/mo (RunPod)

The L40S is the most accessible GPU for serious LLM inference in Europe: 48 GB of VRAM fits models up to ~35B parameters in FP8. The RTX PRO 6000 Blackwell, with double the VRAM and nearly double the bandwidth, is the new reference point for a cloud-API-grade experience.

Four models, two architectures

Model

Architecture

Parameters

Context

Vendor

Gemma 4 31B-it

Dense

31B (all active)

128K

Google DeepMind

Gemma 4 26B-A4B-it

MoE

26B (~4B active)

128K

Google DeepMind

Qwen3.5 27B

Dense + thinking

27B (all active)

128K

Alibaba Cloud

Qwen3.5 35B-A3B

MoE + thinking

35B (~3B active)

128K

Alibaba Cloud

Dense or MoE? A dense model uses all of its parameters for every token. An MoE (Mixture of Experts) routes each token to a subset of experts only (~4B out of 26B for Gemma): less compute per token, but all weights must still fit in VRAM.

FP8 quantization: fitting 31B into 48 GB

A 31-billion-parameter model at full precision (FP32) would weigh ~124 GB. Quantization reduces weight precision to compress the model:

Format

Bits/param

Size for 31B

Quality

FP32

32

~124 GB

Reference (training)

FP16 / BF16

16

~62 GB

Near identical

FP8 (our choice)

8

~31 GB

< 1% degradation

INT4 / GPTQ

4

~16 GB

Noticeable degradation

FP8 is the 2026 sweet spot: half the size of FP16, near-zero degradation in academic benchmarks, native hardware acceleration on Ada Lovelace and Blackwell, and single-flag vLLM support. All four models ship native FP8 checkpoints on Hugging Face.

Methodology: real conditions, not an academic benchmark

Most LLM benchmarks measure isolated tasks (MMLU, HumanEval…). We wanted to measure what users actually experience inside Eridia:

  • Full system prompt — the real Eridia prompt (~6,000 tokens): business instructions, user context, safety rules
  • 5 declared tools — file search, code execution, document creation, meeting search, user interaction (OpenAI function calling format)
  • 7 typical French-language queries — from a simple "Hello" to complex legal analysis
  • 3 runs per query, measured in streaming: TTFT (Time To First Token), generation tokens/s, tool-call reliability

#

Query

Category

1

"Hello, how can you help me?"

Simple chat

2

"Find my recent contracts, the PDFs uploaded this week"

Tool call

3

"Summarize my last meeting with the product team"

Tool call

4

"A 3-year non-compete clause — is it valid?"

Reasoning

5

"Create an action plan from this morning's meeting"

Reasoning + tool

6

"Explain GDPR for health data in detail"

Long output

7

Multi-turn conversation about Eridia vs ChatGPT

Multi-turn

The primary inference engine is vLLM (v0.19+, native FP8, 24K context). We also tested llama.cpp + TurboQuant on the L40S to compare both approaches — results below.

Results

Overview

Averages across the 7 queries, 3 runs each, with vLLM:

Model

GPU

Avg TTFT

Tokens/s

Tool calls

Verdict

Gemma 4 26B-A4B

RTX PRO 6000

61 ms

39.5

100%

Outright champion

Gemma 4 31B

RTX PRO 6000

64 ms

34.5

100%

Best quality

Gemma 4 31B

L40S

195 ms

15.8

100%

Best value

Gemma 4 26B-A4B

L40S

260 ms

16.3

100%

Good quality/VRAM ratio

Qwen3.5 27B

RTX PRO 6000

3.5 s

28.4

100%

Prohibitive TTFT

Qwen3.5 35B-A3B

RTX PRO 6000

4.3 s

24.0

100%

Prohibitive TTFT

Qwen3.5 27B

L40S

8.4 s

8.9

100%

Too slow for chat

Qwen3.5 35B-A3B

L40S

10.5 s

9.1

100%

Too slow for chat

Three immediate takeaways:

  1. The Gemma 4 models are alone in the race for interactive chat: first token under 300 ms everywhere, and under 70 ms on Blackwell.
  2. Tool calling is 100% reliable across all 8 configurations — no longer a differentiator in 2026.
  3. The Qwen3.5 models pay for their thinking mode: generation itself is fine, but users stare at an empty screen for several seconds before every answer.

What Blackwell buys you

Model

TTFT (L40S → Blackwell)

Tokens/s (L40S → Blackwell)

Gain

Gemma 4 26B-A4B

260 ms → 61 ms

16.3 → 39.5

×2.4

Gemma 4 31B

195 ms → 64 ms

15.8 → 34.5

×2.2

Qwen3.5 27B

8.4 s → 3.5 s

8.9 → 28.4

×3.2

Qwen3.5 35B-A3B

10.5 s → 4.3 s

9.1 → 24.0

×2.6

The RTX PRO 6000 delivers an average ×2.6 throughput gain and divides TTFT by 3 to 4. The explanation is direct: memory bandwidth is the limiting factor of LLM inference, and the Blackwell's GDDR7 (~1,600 GB/s vs 864 GB/s) nearly doubles it.

Zooming in: the champion's profile

Per-query detail for the best configuration, Gemma 4 26B-A4B on RTX PRO 6000:

Query

TTFT

Tokens/s

Tokens generated

Simple chat

61 ms

39.4

375

File search (tool)

189 ms

33.6

96

Meeting summary (tool)

190 ms

37.8

99

Legal reasoning

166 ms

40.8

874

Action plan (tool)

173 ms

41.4

202

GDPR explanation (long output)

261 ms

38.4

1,210

Multi-turn

62 ms

43.3

568

Two patterns hold across every configuration we tested: tool-call queries have a TTFT roughly 3x higher (the model must decide to call the tool before emitting anything), and throughput stays stable even on long 1,200+ token outputs. The Qwen3.5 models, by contrast, show TTFT spikes up to 28 seconds on reasoning queries — thinking mode runs away precisely where users expect a fast answer.

vLLM or llama.cpp?

We replayed the same queries on the same L40S with llama.cpp (TurboQuant fork, GGUF models) to compare the two reference inference engines:

Metric

vLLM FP8

llama.cpp Q4_K_M

TTFT (first token)

195–260 ms

3.6–11 s

Generation tokens/s

15–16

up to 130

VRAM (Gemma 4 26B MoE)

~40 GB

~17 GB

Multi-user

Excellent (continuous batching)

Limited

Setup

Python stack

Single compiled binary

The tradeoff is fundamental and symmetrical:

  • vLLM is optimized for prompt processing: specialized CUDA kernels mean the first token arrives almost instantly, even with a 6,000-token system prompt. In exchange, generation tops out around 16 tokens/s on the L40S.
  • llama.cpp is optimized for sequential generation: pure C++, minimal overhead, up to 130 tokens/s — 8x faster than vLLM on the same GPU. But prefill is far less optimized: 3 to 11 seconds before the first token.

Add the VRAM argument: in GGUF Q4_K_M with TurboQuant's compressed KV cache (~6% speed penalty), the same model fits in 17 GB instead of 40 — enough headroom to target 64K+ token contexts without saturating the GPU.

In practice: interactive chat and multi-user workloads → vLLM; long document generation, batch processing, very long contexts or tight VRAM budgets → llama.cpp + TurboQuant.

Analysis: why such gaps?

Qwen3.5's thinking mode, a trap for chat

The Qwen3.5 models embed internal reasoning (similar to OpenAI's o1/o3): before each answer, the model generates hidden reasoning tokens that consume GPU time while displaying nothing. This mode is enabled by default and cannot be disabled through the standard API. For batch analysis, fine; for a conversational assistant, it is a deal-breaker.

MoE vs dense: an advantage that only shows with bandwidth

On the L40S, the MoE Gemma 4 26B-A4B is not faster than the dense 31B: inference there is bound by reading weights from memory, and an MoE must read all its weights even if only a fraction is active. On the RTX PRO 6000, the memory bottleneck loosens and the MoE's compute advantage emerges: 39.5 tokens/s vs 34.5 for the dense model — a lead that will grow with multi-user workloads thanks to the lower compute load per request.

What does it cost?

Solution

Monthly cost

Cost for 20 users

Self-hosted L40S (Scaleway)

~€450 fixed

~€22/user

RTX PRO 6000 (RunPod)

~€900 fixed

~€45/user

OpenAI GPT-4o API

Variable

~€50–150/user*

Anthropic Claude API

Variable

~€50–200/user*

Estimate for ~500 requests/user/month with a comparable system prompt.

At 20 users, self-hosting costs 2 to 4 times less than the APIs — with a fixed, predictable cost and data that never leaves Europe. That is exactly the model behind Eridia's custom deployment offer.

Our recommendations

Top pick — Gemma 4 26B-A4B (MoE) on RTX PRO 6000, ~€900/mo. 40 tokens/s, first token in 61 ms: an experience indistinguishable from premium cloud APIs, with full sovereignty. The 96 GB of VRAM leaves headroom for long contexts and multi-user batching. Ideal for 10–50 users.

Tight budget — Gemma 4 31B-it on Scaleway L40S, ~€450/mo. The best answer quality at this price: sub-200 ms TTFT, 16 tokens/s, reliable tool calling, excellent French. Perfect to start with 10–30 users (~€22/user/month), hosted in France.

Maximum quality — Gemma 4 31B-it on RTX PRO 6000, ~€900/mo. If reasoning depth matters more than raw speed: 35 tokens/s, 64 ms TTFT, and the most structured answers of the panel.

Batch workloads and long contexts — Gemma 4 26B-A4B in GGUF Q4_K_M + TurboQuant on L40S, ~€450/mo. 130 tokens/s with only 17 GB of VRAM: the right choice for automation pipelines (summaries, data extraction, report generation) where TTFT is not critical.

Want this stack without managing it yourself? Eridia installs these models turnkey on your infrastructure, with native security and GDPR compliancelet's talk.

What's next?

  • Multi-user tests — measuring degradation with 5, 10, 20 concurrent requests
  • Long-context benchmark — f16 vs turbo4 KV cache at 64K and 128K tokens
  • New models — Mistral Small 3.1 and Llama 4 Scout as soon as they are available
  • Hybrid setup — vLLM for chat + llama.cpp for batch on the same GPU

Benchmark run with Eridia v2 — eridia.ai. Engines: vLLM 0.19+ (native FP8), llama.cpp TurboQuant fork (GGUF Q4_K_M/Q8_0, turbo4 KV cache). GPUs: Scaleway L40S (48 GB, 864 GB/s), RunPod RTX PRO 6000 Blackwell (96 GB, ~1,600 GB/s). 24K context, 7 French-language queries, 3 runs per query, full system prompt (~6,000 tokens). Detailed methodology and scripts available on request.

#Benchmark#LLM#Self-hosted#Gemma#vLLM#GPU

Ready to transform your business?

Join companies that have chosen Eridia to secure and optimize their AI usage