Engineering Blog

Zurück
Publiziert am 4. Mai 2026 von

Coding-LLMs selbst hosten: Erfahrungen aus dem Betrieb auf cloudscale

Dieser Inhalt ist nur auf Englisch verfügbar:

Self-hosting coding LLMs is possible today, but getting them to work well requires significant tuning. We share our experience running and evaluating models on cloudscale.

In today’s world, engineers and businesses increasingly expect to develop software with the help of coding agents. There is a wide range of agents and models available, each with its own trade-offs.

At cloudscale, we care deeply about data privacy and digital sovereignty. We have our own GPU instances which we fully control. When discussing coding agents, it therefore felt natural to compare self-hosting with hosted models and public providers like Anthropic’s Claude Code.

We spent several weeks evaluating this setup. The short answer: self-hosting works, but you inherit complexity that hosted providers hide from you.

The test setup

cloudscale.ch offers GPU instances equipped with NVIDIA RTX PRO 6000 Max-Q GPUs with 96GB VRAM each. Up to four GPUs can be attached to a single VM. Our setup consisted of one VM with one GPU attached:

                ┌───────────────┐
                │   Clients     │
                │ (opencode,    │
                │  agents, …)   │
                └──────┬────────┘
                ┌──────▼────────┐
                │    Caddy      │
                │    (TLS)      │
                └──────┬────────┘
                ┌──────▼────────┐
                │   LiteLLM     │
                │ (API gateway) │
                │ OpenAI compat │
                └──────┬────────┘
                ┌──────▼────────┐
                │     vLLM      │
                │  (inference)  │
                └──────┬────────┘
                ┌──────▼────────┐
                │     GPU       │
                │ RTX 6000 96GB │
                └───────────────┘

Caddy handles TLS termination. LiteLLM provides an API gateway with model routing. vLLM loads and serves the model using the underlying GPU.

Why LiteLLM

LiteLLM gives you a single API endpoint that can route requests across multiple backends: a local model on vLLM, an alternative local model, or a hosted provider as fallback. It also provides access control and usage tracking as well as many more features.

            ┌────────────────────┐
            │     LiteLLM        │
            │  (routing layer)   │
            └─────────┬──────────┘
        ┌─────────────┼─────────────┐
        │             │             │
 ┌──────▼──────┐ ┌────▼──────┐ ┌────▼────────┐
 │ Local 35B   │ │ Local alt │ │ Hosted LLM  │
 └─────────────┘ └───────────┘ └─────────────┘

What fits in 96GB

Very early on, we had to determine what fits into 96GB of VRAM. The sweet spot seems to be ~26-35B models and we had good success with FP8 quantization. A natural next step is NVFP4: our RTX PRO 6000 Max-Q GPUs use NVIDIA's Blackwell architecture, which has native hardware support for this 4-bit floating-point format. Compared to FP8, NVFP4 should either let us fit larger models into 96GB or leave more headroom for the KV cache and concurrent requests. We haven't tested it yet, but it's high on our list.

Our test setup was minimal: a single engineer comparing results between the self-hosted models and Claude Opus 4.6/4.7 using Claude Code. Towards the end of our testing period, we briefly tested concurrency, but only to a limited extent. A broader, production-like test is still pending.

Model evaluation

Initially we tested Qwen2.5-Coder-32B and deepseek-coder-33b-awq. This was not sufficient for our use case. Partly because our setup was still suboptimal at that stage compared to later in the test period, but also because newer models are just much better than older ones.

When Gemma 4 was released, we immediately switched to it (first gemma-4-31B-it, then gemma-4-26B-A4B-it). The improvement was night and day. This model worked much better in all use cases. However, compared to Claude Code, it still lagged significantly behind when it comes to code quality.

Qwen3.6-35B was finally the model which gave us confidence that self-hosting is viable. Coding quality is quite good, and reasoning ability is solid as well. Compared to Claude Opus 4.6/4.7, it's not quite there yet, but it's actually usable.

To compare more, we also configured LiteLLM to proxy models from Infomaniak: Qwen3-VL-235B, GPT-OSS-120B, Kimi-K2.5. We chose Infomaniak because it is a Swiss provider that hosts these models on its own infrastructure in Swiss data centers.

Comparing these three, there was a clear winner when it comes to reasoning and coding ability: Kimi-K2.5. We did notice a bit of an issue with latency though. During the day the latency (Time To First Token TTFT) was much higher than in the evening. This appears to be a provider-side issue and may improve with further usage or coordination.

The time-consuming part: Configuration & Iterations

The biggest surprise was not getting a model to run, but getting it to behave like a reliable coding assistant.

Early on, we noticed that configuring LiteLLM and vLLM is non-trivial. There are lots of knobs to tune to get a system running smoothly. Each model required its own set of parameters to be tuned and iterated on. Context window, max token length, KV cache, and VRAM usage were among the most challenging aspects to figure out. We do have a configuration which mostly works now, but it's not fully there yet. It still requires more iterations.

Example configuration

Example flags we used for Qwen3.6-35B-A3B-FP8:

--max-model-len 131072
--gpu-memory-utilization 0.92
--enable-prefix-caching
--enable-auto-tool-choice
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--kv-cache-dtype fp8
--max-num-seqs 8
--max-num-batched-tokens 8192

These parameters are highly model-specific, and finding a configuration requires still more experimentation. In particular, gpu-memory-utilization and kv-cache-dtype influenced how efficiently we could use the available VRAM, while max-model-len had a direct impact on memory usage and latency. These settings are not final, and we expect further improvements as our understanding of vLLM, LiteLLM, and agent behavior evolves.

Agent

The agent we used also has a significant impact on the overall experience. We initially wanted to test using the Claude code CLI in order to have the least difference between different options we tested. However it turned out that it's not quite trivial to make Claude code work against self-hosted systems because it uses a different API (Messages API and not Responses/Chat Completions API). While LiteLLM has a translation layer, it did not work reliably in our case. Most likely because the code path to decide on which translation layer to use is not configured for self-hosted models by default.

Beyond the privacy and sandbox trade-offs that come with any agent choice, agent compatibility turned out to be a bigger constraint than expected.

Opencode worked very well from the get-go. It's highly customizable and easy to configure different models for different agents. Still, it takes time to figure out which model to use for which agent (plan vs. build vs. explore vs. general), and settings such as temperature have a big effect on the quality.

Is self-hosting viable?

Getting a self-hosted system to perform well requires tuning across multiple layers—model, inference stack, and agent integration. There are many parameters to tune until you get a decent system. It still may not support all features provided by hosted solutions (e.g. features like exposing model reasoning are not consistently supported with all models).

What you get with self-hosted is full control, data privacy, and predictable cost. But it requires time to tune, debug, and maintain. Case in point: during our testing period, we upgraded vLLM and LiteLLM almost daily.

With hosted providers, you get good defaults, mostly stable behavior, and enjoy minimal setup. However, you also take on a heavy dependency on the reliability of that service: when the provider has an outage, degrades performance, or changes pricing or model behavior, your engineering workflow is directly affected.

To sum up, self-hosting is viable, but not plug-and-play. You inherit the complexity that hosted providers hide. This partially also applies to hosted models like those from Infomaniak. While you get an already optimized model setup, choosing which model to use for what task, and ensuring cost predictability is also non-trivial. In both cases, the number of users and expected context window are important parameters.

For cloudscale.ch, we have not yet reached a final conclusion. What is clear, however, is that self-hosting coding LLMs is viable but introduces a non-trivial operational and engineering cost that is often underestimated.


Wenn du uns Kommentare oder Korrekturen mitteilen möchtest, kannst du unsere Engineers unter engineering-blog@cloudscale.ch erreichen.

Zurück zur Übersicht