<?xml version="1.0" encoding="utf-8" ?>
    <rss
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:content="http://purl.org/rss/1.0/modules/content/"
      xmlns:atom="http://www.w3.org/2005/Atom"
      version="2.0"
    >
      <channel>
        <title><![CDATA[Cloudscale Engineering Blog RSS Feed]]></title>
        <description>
          <![CDATA[Von cloudscale Engineers verfasste Beiträge zu technischen Themen, ungefiltert und detailliert.]]>
        </description>
        <link>https://www.cloudscale.ch</link>
        <language>de</language>
        <lastBuildDate>Mon, 04 May 2026 00:00:00 GMT</lastBuildDate>
        <atom:link href="https://www.cloudscale.ch/rss-engineering-blog-de.xml" rel="self" type="application/rss+xml" />
        
        <item>
          <title><![CDATA[Self-hosting coding LLMs: lessons from running them on cloudscale
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2026/05/04/self-hosting-coding-llms</link>
          <pubDate>Mon, 04 May 2026 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2026/05/04/self-hosting-coding-llms</guid>
          <description>
            <![CDATA[<p>Self-hosting coding LLMs is possible today, but getting them to work well requires significant tuning. We share our experience running and evaluating models on cloudscale.</p>]]>
          </description>
          <content:encoded><![CDATA[<p>In today’s world, engineers and businesses increasingly expect to develop software with the help of coding agents.
There is a wide range of agents and models available, each with its own trade-offs.</p>
<p>At cloudscale, we care deeply about <a href="https://www.cloudscale.ch/en/engineering-blog/2025/09/17/digital-sovereignty">data privacy and digital sovereignty</a>. We have our own GPU instances which we fully control.
When discussing coding agents, it therefore felt natural to compare self-hosting with hosted models and public providers like Anthropic’s Claude Code.</p>
<p>We spent several weeks evaluating this setup. The short answer: self-hosting works, but you inherit complexity that hosted providers hide from you.</p>
<h3>The test setup</h3>
<p>cloudscale.ch offers GPU instances equipped with NVIDIA RTX PRO 6000 Max-Q GPUs with 96GB VRAM each. Up to four GPUs can be attached to a single VM.
Our setup consisted of one VM with one GPU attached:</p>
<pre><code>                ┌───────────────┐
                │   Clients     │
                │ (opencode,    │
                │  agents, …)   │
                └──────┬────────┘
                       │
                ┌──────▼────────┐
                │    Caddy      │
                │    (TLS)      │
                └──────┬────────┘
                       │
                ┌──────▼────────┐
                │   LiteLLM     │
                │ (API gateway) │
                │ OpenAI compat │
                └──────┬────────┘
                       │
                ┌──────▼────────┐
                │     vLLM      │
                │  (inference)  │
                └──────┬────────┘
                       │
                ┌──────▼────────┐
                │     GPU       │
                │ RTX 6000 96GB │
                └───────────────┘
</code></pre>
<p>Caddy handles TLS termination. LiteLLM provides an API gateway with model routing. vLLM loads and serves the model using the underlying GPU.</p>
<h4>Why LiteLLM</h4>
<p>LiteLLM gives you a single API endpoint that can route requests across multiple backends: a local model on vLLM, an alternative local model, or a hosted provider as fallback.
It also provides access control and usage tracking as well as many more features.</p>
<pre><code>            ┌────────────────────┐
            │     LiteLLM        │
            │  (routing layer)   │
            └─────────┬──────────┘
                      │
        ┌─────────────┼─────────────┐
        │             │             │
 ┌──────▼──────┐ ┌────▼──────┐ ┌────▼────────┐
 │ Local 35B   │ │ Local alt │ │ Hosted LLM  │
 └─────────────┘ └───────────┘ └─────────────┘
</code></pre>
<h4>What fits in 96GB</h4>
<p>Very early on, we had to determine what fits into 96GB of VRAM. The sweet spot seems to be ~26-35B models and we had good success with FP8 quantization.
A natural next step is NVFP4: our RTX PRO 6000 Max-Q GPUs use NVIDIA&#x27;s Blackwell architecture, which has native hardware support for this 4-bit floating-point format.
Compared to FP8, NVFP4 should either let us fit larger models into 96GB or leave more headroom for the KV cache and concurrent requests. We haven&#x27;t tested it yet, but it&#x27;s high on our list.</p>
<p>Our test setup was minimal: a single engineer comparing results between the self-hosted models and Claude Opus 4.6/4.7 using Claude Code.
Towards the end of our testing period, we briefly tested concurrency, but only to a limited extent. A broader, production-like test is still pending.</p>
<h3>Model evaluation</h3>
<p>Initially we tested Qwen2.5-Coder-32B and deepseek-coder-33b-awq. This was not sufficient for our use case. Partly because our setup was still suboptimal
at that stage compared to later in the test period, but also because newer models are just much better than older ones.</p>
<p>When Gemma 4 was released, we immediately switched to it (first gemma-4-31B-it, then gemma-4-26B-A4B-it).
The improvement was night and day. This model worked much better in all use cases. However, compared to Claude Code, it still
lagged significantly behind when it comes to code quality.</p>
<p>Qwen3.6-35B was finally the model which gave us confidence that self-hosting is viable. Coding quality is quite good,
and reasoning ability is solid as well. Compared to Claude Opus 4.6/4.7, it&#x27;s not quite there yet, but it&#x27;s actually usable.</p>
<p>To compare more, we also configured LiteLLM to proxy models from Infomaniak: Qwen3-VL-235B, GPT-OSS-120B, Kimi-K2.5.
We chose Infomaniak because it is a Swiss provider that hosts these models on its own infrastructure in Swiss data centers.</p>
<p>Comparing these three, there was a clear winner when it comes to reasoning and coding ability: Kimi-K2.5.
We did notice a bit of an issue with latency though. During the day the latency (Time To First Token TTFT) was much higher than in the evening.
This appears to be a provider-side issue and may improve with further usage or coordination.</p>
<h3>The time-consuming part: Configuration &amp; Iterations</h3>
<p>The biggest surprise was not getting a model to run, but getting it to behave like a reliable coding assistant.</p>
<p>Early on, we noticed that configuring LiteLLM and vLLM is non-trivial. There are lots of knobs to tune
to get a system running smoothly. Each model required its own set of parameters to be tuned and iterated on.
Context window, max token length, KV cache, and VRAM usage were among the most challenging aspects to figure out.
We do have a configuration which mostly works now, but it&#x27;s not fully there yet. It still requires more iterations.</p>
<h4>Example configuration</h4>
<p>Example flags we used for <code>Qwen3.6-35B-A3B-FP8</code>:</p>
<pre><code>--max-model-len 131072
--gpu-memory-utilization 0.92
--enable-prefix-caching
--enable-auto-tool-choice
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--kv-cache-dtype fp8
--max-num-seqs 8
--max-num-batched-tokens 8192
</code></pre>
<p>These parameters are highly model-specific, and finding a configuration requires still more experimentation.
In particular, gpu-memory-utilization and kv-cache-dtype influenced how efficiently we could use the available VRAM,
while max-model-len had a direct impact on memory usage and latency. These settings are not final,
and we expect further improvements as our understanding of vLLM, LiteLLM, and agent behavior evolves.</p>
<h4>Agent</h4>
<p>The agent we used also has a significant impact on the overall experience. We initially wanted to test using the Claude code CLI in order to
have the least difference between different options we tested. However it turned out that it&#x27;s not quite trivial to make
Claude code work against self-hosted systems because it uses a different API (Messages API and not Responses/Chat Completions API).
While LiteLLM has a translation layer, it did not work reliably in our case. Most likely because the code path to decide on which
translation layer to use is not configured for self-hosted models by default.</p>
<p>Beyond the <a href="https://www.cloudscale.ch/en/engineering-blog/2026/04/24/navigating-ai-coding-agents">privacy and sandbox trade-offs</a> that come with any agent choice, agent compatibility turned out to be a bigger constraint than expected.</p>
<p>Opencode worked very well from the get-go. It&#x27;s highly customizable and easy to configure different
models for different agents. Still, it takes time to figure out which model to use for which agent (plan vs. build vs. explore vs. general), and
settings such as <a href="https://opencode.ai/docs/agents/#temperature"><code>temperature</code></a> have a big effect on the quality.</p>
<h2>Is self-hosting viable?</h2>
<p>Getting a self-hosted system to perform well requires tuning across multiple layers—model, inference stack, and agent integration.
There are many parameters to tune until you get a decent system. It still may not support all features
provided by hosted solutions (e.g. features like exposing model reasoning are not consistently supported with all models).</p>
<p>What you get with self-hosted is full control, data privacy, and predictable cost. But it requires time to tune, debug, and maintain.
Case in point: during our testing period, we upgraded vLLM and LiteLLM almost daily.</p>
<p>With hosted providers, you get good defaults, mostly stable behavior, and enjoy minimal setup. However, you also take on a heavy dependency on the reliability of that service:
when the provider has an outage, degrades performance, or changes pricing or model behavior, your engineering workflow is directly affected.</p>
<p>To sum up, self-hosting is viable, but not plug-and-play. You inherit the complexity that hosted providers hide.
This partially also applies to hosted models like those from Infomaniak. While you get an already optimized model setup,
choosing which model to use for what task, and ensuring cost predictability is also non-trivial.
In both cases, the number of users and expected context window are important parameters.</p>
<p>For cloudscale.ch, we have not yet reached a final conclusion.
What is clear, however, is that self-hosting coding LLMs is viable
but introduces a non-trivial operational and engineering cost that is often underestimated.</p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Digital Sovereignty and Security: Navigating AI Coding Agents
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2026/04/24/navigating-ai-coding-agents</link>
          <pubDate>Fri, 24 Apr 2026 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2026/04/24/navigating-ai-coding-agents</guid>
          <description>
            <![CDATA[<p>I believe close-to-bare-metal freedom with AI coding agents is possible and important. But I&#x27;d be doing you a disservice if I pretended it was easy, or that sovereignty alone makes you safe.</p>]]>
          </description>
          <content:encoded><![CDATA[<link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-house-of-ai-agents.png"/><p>Let me explain what I mean.</p>
<img width="480" src="https://static.cloudscale.ch/img/engineering-blog-house-of-ai-agents-7a0163562cef.png" alt="engineering-blog-house-of-ai-agents.png" caption="The House of Safe Coding Agents: my current mental model for safe and sovereign use. We will see how it ages."/>
<p>At cloudscale, the belief has always been that where control and independence matter, operating one&#x27;s own stack is the best approach: from our Linux-based OS to our Ceph storage and OpenStack cloud, all the way down. Self-hosted inference is simply the next logical layer of that same commitment: As AI becomes core infrastructure, relying on external providers would contradict the principles of digital sovereignty, as outlined in <a href="https://www.cloudscale.ch/en/engineering-blog/2025/09/17/digital-sovereignty">this article</a>. cloudscale offers <a href="https://www.cloudscale.ch/en/gpu">GPU infrastructure</a> for its customers and self-hosted inference is putting it to use in-house.</p>
<h3>The Clipboard Problem</h3>
<p>Even on the most privacy-conscious engineering team, you are always one mishap away from a leak. This isn&#x27;t hypothetical. We&#x27;ve all seen it happen with Google Translate. Someone works on a ticket written in a foreign language, pastes a block of customer data to translate it, and suddenly that data has touched a server none of us approved.</p>
<p>Coding agents introduce the same failure mode at higher velocity, and they don&#x27;t wait to be asked. They reach for what they need: a schema, an env file, a support thread sitting open in context. The agent helpfully processes it. You get your answer. And somewhere, something you didn&#x27;t intend to share has left the building.</p>
<p>These are the stakes: we&#x27;re not just protecting source code. We&#x27;re protecting proprietary logic, credentials, customer data, and whatever else ends up in that prompt. Self-hosting your inference layer is one of the most meaningful choices you can make here. But it&#x27;s a starting point, not a finish line.</p>
<h3>The Obvious Part: Self-Hosted Inference</h3>
<p>If you decide to go down this road, you probably picture something like the Ollama experience: pull a model, run a command, done. The go-kart version. Clean, fast, fun. And for one engineer, on one machine, running one task at a time, it genuinely is.</p>
<p>Then the second engineer joins. Then the third. Then someone runs Claude Code with four parallel sub-agents on a long-context refactor, while two colleagues are doing the same. Suddenly you need to think about who gets GPU time, memory bandwidth, and when, how requests queue, whether one heavy session can starve everyone else. You didn&#x27;t buy a bigger go-kart. That&#x27;s how you end up with a commercial jetliner on the runway while barely understanding why you need flaps. Not because you made a bad decision, but because a tool that works beautifully for one person becomes infrastructure the moment it serves a team. And infrastructure has a completely different set of requirements than a local dev tool.</p>
<p>And here&#x27;s the uncomfortable truth: if you actually care about safety, you need to go deep. Blindly accepting whatever your favorite AI suggests as a fix for your performance problem is exactly the behavior you&#x27;re trying to protect against, so the same standard applies here. You need to actually understand what&#x27;s happening, and that means going deep on dynamic and continuous batching, KV caching, and before long you&#x27;re dusting off your understanding of attention mechanisms just to reason about how many long-context multi-agent users fit into a given amount of VRAM. The good news: this stuff is genuinely fascinating. The bad news: it&#x27;s probably not what you had in mind when you started this undertaking on a Tuesday afternoon hoping to offload the boring parts of your development.</p>
<p>This isn&#x27;t a reason to avoid it. I genuinely think it&#x27;s worth it, if you value privacy. But not as a side project, not as a one-afternoon setup that then runs unattended.</p>
<p><em>Update: Michael explains <a href="https://www.cloudscale.ch/en/engineering-blog/2026/05/04/self-hosting-coding-llms">more technical details</a> in a separate post.</em></p>
<h2>The Overlooked Part: The Chatty Client Problem</h2>
<p>Here&#x27;s the assumption that bites people: &quot;I&#x27;m running the model myself, so I&#x27;m private.&quot;</p>
<p>Maybe. But <em>the model</em> and <em>the client</em> are two separate things, and they have two separate surfaces that can phone home.</p>
<p>Your inference backend, Ollama, vLLM, whatever you&#x27;re running, may have diagnostics, telemetry, or update-check behavior baked in. That&#x27;s worth auditing, but it&#x27;s usually the easier half to control.</p>
<p>Your agent client, Claude Code, Cursor, Copilot, whatever your developers are actually typing into, is a different story entirely. The client can, and most of the well-known ones do, send telemetry, crash reports, usage analytics, and in some cases full prompt content to vendor servers, regardless of where the model lives. &quot;Open source&quot; on the label doesn&#x27;t automatically mean audited data practices; some of the most popular tools in this space are more opaque than they appear.</p>
<p>The categories to look for when you audit:</p>
<ul>
<li><strong>Usage analytics</strong> - what gets counted and reported</li>
<li><strong>Crash reporting</strong> - Sentry-style tools that may capture context around errors</li>
<li><strong>Prompt feedback loops</strong> - the big one: is the vendor training on your prompts? Read the terms carefully.</li>
<li><strong>License and update pings</strong> - lower risk, but worth knowing about</li>
</ul>
<p>Now, the honest part: reading privacy policies is the boring part of this work. I know. We became engineers so we wouldn&#x27;t have to do this. But for this specific use case, the privacy policy is load-bearing. It&#x27;s the document that tells you what actually happens to your data.</p>
<h3>The Danger Zone: What Developers Need to Understand Now</h3>
<p>Everything we&#x27;ve discussed so far, the careful inference setup, the audited client, the dialed-in permissions, can be silently undone by what happens in this section. A single successful prompt injection can turn your sovereign, self-hosted, privacy-preserving setup into an exfiltration machine. All that careful work, on fire, from the inside.</p>
<p>Every single one of these risks applies equally if you&#x27;re running Claude Code against Anthropic&#x27;s API, Cursor against OpenAI, or any self-hosted solution. The threat model doesn&#x27;t care where the model lives. It cares what the agent is allowed to do.</p>
<p>This is my current shortlist, not exhaustive, and guaranteed to look different in six months. But these are the concepts I keep coming back to. (As it turns out, OWASP agrees: they published an <a href="https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/">OWASP Top 10 for Agentic Applications for 2026</a>, and the overlap is uncomfortably exact.)</p>
<p><strong>Prompt injection.</strong> An attacker plants instructions inside something the agent will read, a README, a code comment, an API response, a file in the repo. The agent reads it and treats it as instruction, the same way it treats yours. Your codebase, and everything it touches, is now an attack surface. This is why tools like Context7 and other MCP servers that pull in external content by design give me cold sweats. The injection surface isn&#x27;t a bug someone introduced, it&#x27;s the feature.</p>
<p><strong>Slopsquatting (supply chain attack via model hallucination).</strong> Models confidently suggest packages that don&#x27;t exist. Attackers register those hallucinated package names with malicious payloads, waiting for the agent, or the developer approving it, to run the install. It&#x27;s the npm install problem applied to the space between what the model knows and what it invents. Your agent doesn&#x27;t know the difference between a real package and a plausible-sounding one.</p>
<p><strong>Tool permissions and sandbox scope.</strong> An agent that can read files, run shell commands, and make outbound HTTP requests is extremely powerful, and default configurations are almost always too permissive, at least for me to sleep well. What makes this worse than a misconfiguration: agents don&#x27;t shrug when a tool is denied. They optimize around the constraint, chaining whatever&#x27;s available to reach the same destination. Before you trust that your setup is locked down, it&#x27;s worth reading and testing (!) exactly what your sandbox setting actually covers. The answer is usually narrower than the name implies.</p>
<p><em>So: do you think the AI named after the father of information theory can read and exfiltrate <code>/etc/passwd</code> while running in sandboxed mode?</em></p>
<p>If we want to use these systems safely and well, we need to make space for people to learn how they fail, how to test their boundaries, and how to build with them responsibly.</p>
<h3>The Mail Server Lesson</h3>
<p>Running your own mail server never protected you from phishing. Sovereignty and security are related, but they&#x27;re not the same thing, and if privacy is actually your goal, you need both. Sovereignty gives you control over where your data lives. Security determines whether it stays there. One without the other is a half-measure. You can self-host everything perfectly and still get hit through a prompt injection in a README. You can have airtight security practices and still leak prompts through a client you never audited.</p>
<p>I want to be clear about what this piece is not. It&#x27;s not a warning to stop using agents. It&#x27;s not a criticism of the tools or the teams building them. This is an exciting moment, I think this is one of the more interesting periods to be writing software.</p>
<p>But we as an industry have also quietly accumulated new problems most of us didn&#x27;t know existed a year ago. And they are solvable, but not alone. The OWASP Top 10 for Agentic Applications is the industry&#x27;s clearest current attempt to name these problems together. Within each organization it takes engineers willing to go deep, and management willing to make that depth possible. And it takes the security team treating developers as allies to educate, not risks to contain. The teams that treat this as a shared problem will be fine. The ones that throw it over a wall won&#x27;t.</p>
<p>We&#x27;ve been here before.</p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Snapshots Are Not Backups: Disaster Recovery for Kubernetes Workloads
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2026/03/31/snapshots-are-not-backups-disaster-recovery-for-k8s</link>
          <pubDate>Tue, 31 Mar 2026 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2026/03/31/snapshots-are-not-backups-disaster-recovery-for-k8s</guid>
          <description>
            <![CDATA[<p>Snapshots make Kubernetes operations safer, but they are only one part of a Disaster Recovery strategy. In this Engineering Blog post, we show how Kubernetes workloads on cloudscale infrastructure can be restored even after losing an entire cluster, and what is required to turn snapshots into real recovery capabilities.</p>]]>
          </description>
          <content:encoded><![CDATA[<h3>Introduction</h3>
<p>When we introduced <a href="https://www.cloudscale.ch/en/news/2026/03/31/volume-snapshots-with-csi">CSI snapshot support in our Kubernetes CSI driver</a>, we stressed an important distinction: snapshots help with operational recovery, but they do not replace backups.</p>
<p>Snapshots are ideal for rolling back changes or recovering from failed deployments. However, because they remain tied to the original storage environment, they do not protect against the loss of a full Kubernetes setup.</p>
<p>In this post, we outline how to build a Disaster Recovery workflow for Kubernetes workloads on cloudscale infrastructure, combining CSI snapshots, Velero orchestration, and cross-zone Object Storage to restore applications and their data after a complete environment loss.</p>
<h3>Snapshots vs Backups</h3>
<p>Before touching YAML or CLI commands, we need a shared terminology.</p>
<pre><code class="language-text">| Term              | Meaning                                                 |
|-------------------|---------------------------------------------------------|
| Snapshot          | Point-in-time copy stored on the same storage cluster   |
| Backup            | Recoverable copy independent of original infrastructure |
| Disaster Recovery | Ability to rebuild workloads after infrastructure loss  |
</code></pre>
<p>Snapshots are fast because they live close to the source volume. On cloudscale, they are implemented using copy-on-write
technology inside the storage cluster.</p>
<p>That makes them ideal for:</p>
<ul>
<li>Upgrade safety nets</li>
<li>Migrations</li>
<li>Quick rollback scenarios</li>
<li>Cloning production data into test environments</li>
</ul>
<p>But if the storage cluster itself disappears, snapshots disappear with it. This is why we explicitly recommend keeping
a copy of your data at another geographic location. The remainder of this article shows how to implement exactly that,
using standard Kubernetes tooling.</p>
<h3>What CSI Snapshots Change</h3>
<p>With the release of CSI snapshot support, Kubernetes gains native awareness of storage recovery points. The driver
exposes the standard Kubernetes VolumeSnapshot API, which means snapshots are no longer
something managed exclusively through a provider interface or our Control Panel. Instead, they become first-class
Kubernetes resources.</p>
<p>At first glance this may look like a small technical addition. Operationally, however, it changes how backup and
recovery workflows can be designed. Once snapshots exist as Kubernetes objects, ecosystem tools can interact with them
directly. Backup software such as Velero can request snapshots, track them as part of a backup operation, and later use
them during restores, all through standard Kubernetes APIs. The result is a workflow that remains portable,
automation-friendly and aligned with upstream Kubernetes concepts.</p>
<p>This capability also highlights a limitation that often goes unnoticed. Many Kubernetes backup strategies rely on
exported manifests combined with storage snapshots. As long as the cluster and its storage remain available, recovery
appears straightforward, deleted namespaces or failed deployments can usually be restored without difficulty.</p>
<p>The situation changes once the underlying storage is no longer accessible. Recreating Kubernetes objects is rarely the
challenge, but recovering the data they depend on, is.</p>
<p>Disaster Recovery therefore requires separating three independent concerns: Kubernetes resource state, a consistent
recovery source for volume data, and a durable copy stored outside the original infrastructure. CSI snapshots address
only one of these aspects. They provide fast recovery points, but they remain bound to the same storage environment.</p>
<p>This distinction leads directly to the hybrid approach described next.</p>
<h3>Demo</h3>
<p>Before starting, you need:</p>
<ul>
<li>A Kubernetes cluster (version 1.28 or newer)</li>
<li><a href="https://github.com/cloudscale-ch/csi-cloudscale">cloudscale CSI driver</a> installed (at least v4.0.0)</li>
<li><a href="https://velero.io/docs/v1.18/basic-install/">Velero CLI</a> installed locally (tested with v1.18.0)</li>
<li>S3-compatible Object Storage.</li>
</ul>
<p>With that, we build a hybrid approach:</p>
<ul>
<li>CSI snapshots provide consistent recovery sources.</li>
<li>Velero orchestrates backups and restores.</li>
<li>Object Storage stores durable copies in another region.</li>
</ul>
<p>In this example, the Kubernetes cluster runs in LPG, while backup data is written to Object Storage in RMA, separating
recovery data from the original infrastructure. The same approach also works with other providers.</p>
<p>The core idea is simple: snapshots provide consistency, while Object Storage provides survivability.</p>
<h4>1. Create Object Storage Backup Location</h4>
<p>First, create an Object User and save its credentials to a file:</p>
<pre><code class="language-bash">cat &lt;&lt;EOF &gt; credentials-velero
[default]
aws_access_key_id=&lt;ACCESS_KEY&gt;
aws_secret_access_key=&lt;SECRET_KEY&gt;
EOF
</code></pre>
<p>Then create a bucket in a different region than the cluster. In this example we name it <code>velero-backups</code>.
As our Object Storage exposes an S3-compatible API, Velero&#x27;s AWS plugin works without modifications.</p>
<h4>2. Install Velero with CSI Support</h4>
<p>The important part is enabling both CSI snapshots and the Data Mover.</p>
<pre><code class="language-bash">velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.10.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --use-node-agent \
  --backup-location-config \
  region=RMA,s3ForcePathStyle=true,s3Url=https://objects.rma.cloudscale.ch \
  --snapshot-location-config region=LPG \
  --features=EnableCSI,EnableCSIDataMover
</code></pre>
<p>What this configuration establishes:</p>
<ul>
<li>Snapshots are created in LPG (the region where the cluster is running)</li>
<li>Backup data is stored in RMA (needs to be the site the bucket has been created)</li>
</ul>
<p>This way, restores do not depend on the original storage cluster. This separation is the foundation of Disaster
Recovery.</p>
<h4>3. Create a Demo Workload</h4>
<p>We deploy a minimal namespace containing a PVC and a Pod writing data to a file.</p>
<pre><code class="language-bash">kubectl create ns backup-demo
</code></pre>
<p>Create a PersistentVolumeClaim:</p>
<pre><code class="language-bash">cat &lt;&lt;EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: demo-pvc
  namespace: backup-demo
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: cloudscale-volume-ssd
EOF
</code></pre>
<p>Writer Pod:</p>
<pre><code class="language-bash">cat &lt;&lt;EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: writer
  namespace: backup-demo
spec:
  restartPolicy: Never
  containers:
    - name: writer
      image: busybox
      command:
        - sh
        - -c
        - |
          while true; do
            echo &quot;cloudscale recovery demo \$(date &#x27;+%Y-%m-%d %H:%M:%S&#x27;)&quot; &gt;&gt; /data/value.txt
            sleep 5
          done
      volumeMounts:
        - mountPath: /data
          name: vol
  volumes:
    - name: vol
      persistentVolumeClaim:
        claimName: demo-pvc
EOF
</code></pre>
<p>Verify data exists:</p>
<pre><code class="language-bash">kubectl exec -n backup-demo writer -- cat /data/value.txt
</code></pre>
<h4>4. Create the Backup</h4>
<p>Create the Disaster Recovery backup:</p>
<pre><code class="language-bash">velero backup create dr-demo \
  --include-namespaces backup-demo \
  --snapshot-move-data \
  --wait
</code></pre>
<p>Velero now performs a coordinated backup workflow.</p>
<p>First, Velero stores Kubernetes resource metadata in Object Storage.</p>
<p>Next, Velero requests CSI snapshots for all PersistentVolumeClaims in the namespace. The cloudscale CSI driver creates
the snapshots without interrupting running workloads.</p>
<p>Velero then prepares the data transfer by creating a temporary volume from each snapshot. The Velero Node Agent mounts
this volume and reads its contents using the CSI Data Mover.</p>
<p>The snapshot data is copied into Object Storage in RMA, while the production PVC remains untouched throughout the
process. The temporary volumes are removed again.</p>
<p>These steps will look like this in the Control Panel:</p>
<video width="100%" autoPlay="" loop=""><source src="/media/velero-data-mover-demo.mp4" type="video/mp4"/></video>
<h4>5. Simulate Catastrophic Failure</h4>
<p>A backup that is never restored is only a theory, so let&#x27;s simulate loss of the environment:</p>
<pre><code class="language-bash">kubectl delete ns backup-demo
</code></pre>
<p>Everything disappears: Pods, PVCs, and Kubernetes objects are removed, leaving only the off-site backup. This can be verified in the Control Panel, where the PVCs are no longer present.</p>
<h4>6. Restore the Namespace</h4>
<p>We can now load the backup to recreate the workload.</p>
<pre><code class="language-bash">velero restore create dr-demo-restore \
--from-backup dr-demo \
--wait
</code></pre>
<p>Velero recreates the namespace, creates a PVC, restores the volume content and starts a Pod.
This can take a moment. After the process is done, you can verify data:</p>
<pre><code class="language-bash">kubectl exec -n backup-demo writer -- tail /data/value.txt
</code></pre>
<p>If the value matches the original one, while new values are written, the Disaster Recovery test succeeded.</p>
<h3>Conclusion</h3>
<p>CSI snapshot support brings cloudscale storage into Kubernetes-native recovery workflows. Using standard APIs and tools
like Velero, snapshots can evolve from simple rollback mechanisms into real recovery strategies.</p>
<p>The example shown here is intentionally minimal: back up a namespace, remove it, and restore it from scratch. The goal
is not complexity, but proof that recovery works independently of the original environment.</p>
<p>The key takeaway is straightforward: <em>Disaster Recovery begins where recovery no longer depends on the infrastructure that failed.</em></p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Object Storage: Coping With Increased Load and Improving Stability
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2025/12/04/handling-object-storage-load-increase</link>
          <pubDate>Thu, 04 Dec 2025 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2025/12/04/handling-object-storage-load-increase</guid>
          <description>
            <![CDATA[<p>Since the <a href="https://www.cloudscale.ch/en/news/2025/01/30/object-storage-lower-price-and-practical-information">price reduction of our Object Storage</a> in February, usage grew significantly, including total storage consumption. This rapid growth surfaced some issues during load peaks and in our metrics pipeline.</p>]]>
          </description>
          <content:encoded><![CDATA[<h2>RGW internals</h2>
<p>Before diving into our load mitigation strategy, let&#x27;s introduce some RADOS Gateway (RGW) internals.
When an S3 request hits <code>objects.&lt;region&gt;.cloudscale.ch</code>, it travels through two major layers before hitting our Ceph Storage Cluster.</p>
<p><strong>Frontend:</strong>
The part of RGW that accepts client HTTPS requests, parses S3/Swift APIs, and places those requests into the internal queue of RGW. It behaves essentially like the &quot;public-facing door&quot;.</p>
<p><strong>Backend:</strong>
The part of RGW that executes the actual operations against Ceph, such as reading objects, writing data, listing bucket contents, and performing metadata lookups. Backends are the &quot;workers&quot; that do the heavy lifting.</p>
<p>We run multiple frontends and backends to handle high loads. When a client sends a request, the path looks like this:</p>
<pre><code>                   ┌─────────────────────┐
Client  ──https─►  │    RGW Frontend     │
                   │ (HTTP server layer) │
                   └──────────┬──────────┘
                              ▼
                   ┌─────────────────────┐
                   │    RGW Backend      │
                   │ (RADOS operations)  │
                   └──────────┬──────────┘
                              ▼
                   ┌─────────────────────┐
                   │        Ceph         │
                   │  (storage cluster)  │
                   └─────────────────────┘
</code></pre>
<h2>Load Handling Improvements</h2>
<p>New customers bring new load patterns, and one such pattern has caused our RGW backend processes to be strained. <a href="https://grafana.com/oss/loki/">Loki</a>, an increasingly popular log aggregation system, uses S3 for storage and for log queries. These queries use concurrency to speed up retrieval, showing up as sharp spikes on our frontends.</p>
<p>While our services have been tuned for increasing amounts of traffic over the years, these tunings are never final and new load patterns require additional traffic-engineering.
Very spiky peak loads could be handled by adding lots of backends to take on the load, but that is neither economical, nor sustainable: We want to target a certain base-load, with some room for peaks. But we should not provision our services to support loads magnitudes larger than our average.</p>
<p>Maybe counter to intuition, we achieved better peak-load behavior by limiting the number of requests we accept in the backend, instead of adding more backends to handle increased load.</p>
<p>Instead of accepting whatever peaks we get, we can also try and flatten the traffic curve, which is what we do now, limiting the number of concurrent requests at every step:</p>
<ul>
<li>The backends have a certain limit of requests they accept.</li>
<li>Once the backends are at capacity, the frontend queues requests to it.</li>
<li>Once the frontend is at capacity, the kernel queues requests to it.</li>
<li>Once the kernel is at capacity, requests may fail.</li>
</ul>
<p>As these queues fill, we are informed by our monitoring. So in practice, only the first limit is ever reached. When this happens, some interesting properties come into play:</p>
<ul>
<li>We start returning <code>503 Slow Down</code> to the clients that cause the spikes: This is the standard HTTP code used by S3 to signal to clients to back off.</li>
<li>We start preferring certain requests over others: For now we give priority to read over write requests, but we might consider giving priority to clients with few concurrent requests over clients with many concurrent requests (for increased fairness).</li>
</ul>
<p>So far this has proven to be very successful in preventing spikes from overwhelming our infrastructure.</p>
<h2>Handling RGW slow-downs in rgw-metrics</h2>
<p>The rgw-metrics service periodically queries the RGW for usage information about every bucket. This data is used in the Control Panel in the Object Storage tab to display historical usage information and for the billing. For a deeper explanation, see our engineering blog <a href="https://www.cloudscale.ch/en/engineering-blog/2025/01/29/improving-metrics-collection-for-object-storage">Improving metrics collection for our Object Storage</a>.</p>
<p>Normally RGW runs smoothly and rgw-metrics receives reliable responses. This had always been the case, so the system was never designed to handle transient slow-down responses.</p>
<p>When RGW became busy under heavy production load, it became less responsive and returned <code>503 Slow Down</code> responses. You may recognize those from the chapter above. rgw-metrics interpreted these errors as hard failures. Unfortunately, this triggered alerts for the on-call team as the usage data collection service restarted repeatedly. This reduced the number of collected metrics, lowering the resolution of the usage curves.</p>
<p>We now treat <code>503 Slow Down</code> responses as temporary conditions and instead of failing immediately, the service now backs off and retries after a longer interval. This allows the metrics service to recover automatically after high-load peaks and prevents waking on-call engineers in the middle of the night.</p>
<p>The core lesson: Load related optimizations don&#x27;t only affect customers, they also impact our internal systems. Our metrics service was an unintended casualty of a change that otherwise improved production stability.</p>
<h2>Outlook</h2>
<p>We continue to monitor the system closely and extended our tests to include more concurrency and load-related scenarios, so that we can catch similar edge cases earlier and ensure a smooth experience for our users.</p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Generating Truly Sequential IDs in PostgreSQL
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2025/10/09/generieren-von-garantiert-sequentiellen-ids-in-postgresql</link>
          <pubDate>Thu, 09 Oct 2025 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2025/10/09/generieren-von-garantiert-sequentiellen-ids-in-postgresql</guid>
          <description>
            <![CDATA[<p>Postgres provides <em>sequence generators</em> to assign unique values to primary key columns. They have very low overhead and are ideal for this purpose. But surprisingly, they do not actually guarantee that values are assigned in increasing order, which was required for us to make audit logs available in our <a href="https://www.cloudscale.ch/en/api/v1">public API</a>. In this post, I&#x27;ll explain under what circumstances generated IDs can appear in the database out of order and what solution we came up with to prevent this while keeping the impact on performance low.</p>]]>
          </description>
          <content:encoded><![CDATA[<link rel="preload" as="image" href="https://www.cloudscale.ch/img/generating-truly-sequential-ids-in-postgres-concurrent-transactions.png"/><h3>The Audit Log API</h3>
<p>We recently added an <a href="https://www.cloudscale.ch/en/api/v1#project-logs">endpoint</a> to our public API that allows retrieving audit logs records for a specific project:</p>
<pre><code class="language-json">{
  &quot;next&quot;: &quot;https://api.cloudscale.ch/v1/project-logs?cursor=10225&quot;,
  &quot;poll_more&quot;: &quot;https://api.cloudscale.ch/v1/project-logs?cursor=10225&quot;,
  &quot;results&quot;: [
    {
      &quot;ip_address&quot;: &quot;185.98.122.110&quot;,
      &quot;action&quot;: &quot;volume_create&quot;,
      &quot;message&quot;: &quot;Volume &#x27;cache&#x27; has been created&quot;,
      &quot;timestamp&quot;: &quot;2025-09-03T11:15:19.391447Z&quot;,
      ...
    },
    ... // Up to 199 more records.
  ]
}
</code></pre>
<p>All actions performed in our control panel by logged-in users or using API tokens are recorded in the audit log. Because the audit log can contain an unbounded number of records, the API uses pagination to return the results.</p>
<h3>Polling in the Audit Log API</h3>
<p>The audit log API allows a client to poll for new records that were added since the last request. For this reason, it returns a URL in the response&#x27;s <code>poll_more</code> field. Requesting that URL will return all records (possibly spread across multiple pages) that have not yet been returned.</p>
<p>The URLs in the <code>poll_more</code> field have a <code>cursor</code> parameter (I&#x27;ll call those URLs <em>cursor URLs</em>). This parameter simply contains the <em>primary key</em> of the database row of the previous request&#x27;s last record. In our case, the primary key column is called <code>id</code>:</p>
<pre><code>dev=# \d log
                                 Table &quot;public.log&quot;
   Column   |           Type           | Nullable |             Default
------------+--------------------------+----------+----------------------------------
 id         | integer                  | not null | generated by default as identity
 ip_address | inet                     |          |
 action     | character varying(50)    | not null |
 timestamp  | timestamp with time zone | not null |
[some more fields omitted]
Indexes:
    &quot;log_pkey&quot; PRIMARY KEY, btree (id)
</code></pre>
<p>When the application receives a request for a cursor URL, it uses the <code>cursor</code> parameter to determine the set of records to return to the client.</p>
<p>The API guarantees that when polling for records using the <code>poll_more</code> URL, no records will be skipped or returned twice. This is easy to accomplish if the set of records have IDs that are <em>monotonically increasing</em> (foreshadowing): simply return all records with a primary key bigger than <code>cursor</code>, sorted by the primary key.</p>
<h3>PostgreSQL Sequence Generators</h3>
<p>As you have seen in the previous section, our primary key column <code>id</code> is set up to be filled in by Postgres automatically. The important bit is the <code>generated by default as identity</code>, which is documented in <a href="https://www.postgresql.org/docs/17/sql-createtable.html#SQL-CREATETABLE-PARMS-GENERATED-IDENTITY">CREATE TABLE</a>. This uses an implicitly created <em>sequence generator</em>, which is the standard way in Postgres to automatically fill an <code>integer primary key</code> column. Sequence generators are documented separately under <a href="https://www.postgresql.org/docs/17/sql-createsequence.html">CREATE SEQUENCE</a>.</p>
<p>The implementation of Postgres&#x27; sequence generators needs to fulfill a few requirements:</p>
<ol>
<li>The generated values need to be unique (meaning it can&#x27;t e.g. use the current time, because due to granularity or system clock changes, it could return the same value twice).</li>
<li>The generated values have to fit in an <code>integer</code> or <code>bigint</code> column (meaning e.g. random UUIDs can&#x27;t be used).</li>
<li>It needs to be fast (meaning it can&#x27;t search the existing data for an unused ID).</li>
<li>It needs to be thread-safe without using locks that last until the end of the transaction (meaning <code>select max(id) + 1 from ...</code> can&#x27;t be used).</li>
</ol>
<p>To accomplish this, Postgres basically uses a global integer variable per sequence generator, accessed using a short-lived lock. Each database process can increment the corresponding variable and use the new value (the <code>cache</code> parameter of <code>create sequence</code> can be used to changes this to a more efficient but also more complicated scheme).</p>
<p>This implementation fulfills all the requirements listed above. There&#x27;s a fifth requirement that the implementation appears to fulfill, but in reality doesn&#x27;t: That generated IDs appear in increasing order in the database. Look at the following example of two concurrently processed transactions:</p>
<img src="https://static.cloudscale.ch/img/generating-truly-sequential-ids-in-postgres-concurrent-transactions-f7211ce97d78.png" alt="A UML sequence diagram with three lifelines. Two transactions &quot;T1&quot; and &quot;T2&quot;, and a sequence generator &quot;S&quot;. Messages are exchanged in this sequence: 1. T1 request a new sequence value from S. 2. S returns &quot;4&quot; to T1. 3. T2 request a new sequence value from S. 4. S returns &quot;5&quot; to T2. 5. An external actor requests T2 to commit and T2 terminates. 6. A bold dashed line indicates the point in time. 7. An external actor requests T1 to commit and T1 terminates."/>
<p>In the example, two transactions request a value from the sequence generator and are handed out consecutive values (which are then used to set the <code>id</code> column of rows inserted into our audit log record table). Both transactions commit, but in the reverse order than they requested a value from the sequence generator.</p>
<p>At the time indicated by the bold dashed line, the transaction that used ID &quot;5&quot; has already committed, and its result is visible to other transactions, but the transaction that used ID &quot;4&quot; has not yet committed. After that transaction has also committed, both rows will be visible.</p>
<p>If at the point in time of the bold dashed line, a client would request the most recent audit logs, the result&#x27;s <code>poll_more</code> URL would have <code>cursor</code> set to &quot;5&quot;, because that&#x27;s the largest value in the <code>id</code> column. Even if the transaction that used ID &quot;4&quot; would later commit, that row would never be returned to the client. This would violate the guarantee that no rows are skipped when polling for records using the <code>poll_more</code> URLs.</p>
<h3>Implementing truly Sequential IDs</h3>
<p>So we went to look for a solution that matched the following criteria:</p>
<ul>
<li>
<p>Generate sequential IDs.<br/>Audit log records need to be assigned IDs in the order in which the transactions are committed, even if transactions are running in parallel. This is required by how we use cursor URLs to allow a client to poll for new records.</p>
</li>
<li>
<p>Assign the sequential IDs within the transaction.<br/>Delaying assigning the sequential IDs to after committing the transaction would prevent requests immediately following the transaction from returning the new audit log records.</p>
</li>
<li>
<p>Low contention between transactions.<br/>Some of our transactions can run for up to a few 100ms. This is because we have to talk to systems external to our control panel application within some transactions. Acquiring locks for the whole duration of the transaction would prevent processing these transactions in parallel, resulting in bad performance.</p>
</li>
</ul>
<p>After looking at a lot of different approaches, we came up with the following solution. The implementation has 4 parts:</p>
<ul>
<li>An additional column <code>sequence_id</code> added to our log record table <code>log</code>.</li>
<li>A sequence generator used to generate values for <code>sequence_id</code>.</li>
<li>A table <code>log_sequence_id_seq_lock</code>, which is only used for locking and does not store data.</li>
<li>A deferred trigger on <code>log</code> that populates <code>log.sequence_id</code>.</li>
</ul>
<p>The implementation lives fully in the database schema. No code in our application that writes audit log records needed to be modified. This is the full implementation (explanations below):</p>
<pre><code class="language-sql">-- Add new column used to store the sequential IDs.
alter table log add column sequence_id integer unique;

-- Sequence generator used to generated values for log.sequence_id.
create sequence log_sequence_id_seq;

-- Table that is only used for locking by
-- log_sequence_id_trigger_fn().
create table log_sequence_id_seq_lock();

-- Trigger function that assigns values to log.sequence_id.
create function log_sequence_id_trigger_fn()
    returns trigger
    language plpgsql
as $$ declare
begin
    -- Prevent the function from running in parallel.
    lock table log_sequence_id_seq_lock in exclusive mode;

    -- Populate sequence_id of the newly inserted row.
    update log l
    set sequence_id = nextval(&#x27;log_sequence_id_seq&#x27;)
    where l.id = new.id;

    -- Insert already happened, so no need to return the row.
    return null;
end $$;

-- Trigger that runs log_sequence_id_trigger_fn() at the
-- end of the transaction.
create constraint trigger log_sequence_id_trigger
    after insert on log
    initially deferred
    for each row
execute function log_sequence_id_trigger_fn();
</code></pre>
<p><code>log_sequence_id_trigger_fn()</code> is set up to run whenever a row is inserted into <code>log</code> using a trigger. The trigger is set to <em>deferred</em>, so the function will run at the end of the transaction, instead of immediately when the row is inserted. The functions should complete very quickly, which is important because it acquires a lock that prevents the function from running in parallel.</p>
<p>The lock used by <code>log_sequence_id_trigger_fn()</code> is a relation-level lock on table <code>log_sequence_id_seq_lock</code>. That table is only used for this purpose of locking, no data is stored in that table. An alternative would have been to use Postgres&#x27; <a href="https://www.postgresql.org/docs/17/explicit-locking.html#ADVISORY-LOCKS">Advisory Locks</a>, but these are less convenient because they only allow integer values to be used as lock names. Using a table with a descriptive name seemed less obtuse.</p>
<p>After acquiring the lock, <code>log_sequence_id_trigger_fn()</code> uses <a href="https://www.postgresql.org/docs/17/functions-sequence.html"><code>nextval()</code></a> to get a value from the sequence generator <code>log_sequence_id_seq</code> and update <code>sequence_id</code> of the inserted row with it. The lock is automatically released when the transaction finally commits.</p>
<p>This sequence generator works exactly the same way as one that gets implicitly created by Postgres when creating a <code>generated by default as identity</code> column. But because of the explicit locking around its usage, the problems mentioned above are prevented.</p>
<p>When processing API requests, the implementation can reliably select records added since the last request using the values of <code>sequence_id</code>.</p>
<h3>Conclusion</h3>
<p>We&#x27;ve seen how Postgres&#x27; <em>sequence generators</em>, which work well in most circumstances, especially when high transaction throughput is necessary, can&#x27;t directly be used to guarantee sequential IDs on inserted rows. Nonetheless, Postgres has all the tools necessary to implement a solution that provides this guarantee without sacrificing too much in terms of performance or complexity.</p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Digital Sovereignty: Why We Run Our Own Stack
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2025/09/17/digital-sovereignty</link>
          <pubDate>Wed, 17 Sep 2025 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2025/09/17/digital-sovereignty</guid>
          <description>
            <![CDATA[<p>At cloudscale, we like to exert as much control over our infrastructure as possible. In this blog post, I share some insights about our internal tooling and infrastructure, the risks of relying on U.S. vendors, and what all of that means regarding digital sovereignty.</p>]]>
          </description>
          <content:encoded><![CDATA[<p>In a universe dominated by a few U.S. tech giants, it&#x27;s not always easy maintaining decent control over your digital assets.</p>
<p>Unless you&#x27;re willing to take rather extreme approaches, your choice of mobile phone operating system is one between Apple and Google.
The cloud service market is dominated by Amazon, Microsoft, and Google.
To fullfil your SaaS groupware needs, feel free to choose between Microsoft and Google.</p>
<h3>U.S. Big Tech</h3>
<p>While there are few determined souls not using any services of the &quot;Big Five&quot; (Amazon, Apple, Meta, Microsoft, and Google), most of us will happily rely on some of their free offerings for personal use.
As for satisfying your business needs regarding groupware or cloud services, no one in IT procurement has ever been fired for choosing &quot;industry standards&quot; like AWS, GCP, or Azure.</p>
<p>Entrusting the large U.S. players with your own or your customers data comes with significant downsides and risks though.</p>
<h4>Marketing ahoy</h4>
<p>Social media platforms are lovely to stay in touch with people from around the world. Seeing Marco on his latest  kite surfing adventure in Egypt, or Rachel relaxing on her beach vacation in Greece is heart-warming and allows me to stay in touch with friends I don&#x27;t interact with very often. Though know that <a href="https://www.forbes.com/sites/marketshare/2012/03/05/if-youre-not-paying-for-it-you-become-the-product/">If You&#x27;re Not Paying For It, You Become The Product</a>.</p>
<h4>Vendor lock-in</h4>
<p>While migrating your services to any of the mentioned vendors is made easy by a plethora of migration tools, you&#x27;ll often find yourself in quite some pain trying to migrate away.
In combination with purposefully obscure cost and subscription models, these kinds of relationship may often leave you feeling somewhat frustrated and powerless.</p>
<h4>U.S. Cloud Act</h4>
<p>After Microsoft refused to hand over data related to an FBI investigation - arguing the emails in question were located in an Ireland data center and therefore outside of U.S. jurisdiction - congress passed the <a href="https://en.wikipedia.org/wiki/CLOUD_Act">Cloud Act</a> in 2018, which:</p>
<ul>
<li>Expands the reach of U.S. law enforcement over data held by U.S. providers abroad.</li>
<li>Facilitates foreign law enforcement access to data from U.S. providers.</li>
</ul>
<h4>U.S. jurisdiction extended</h4>
<p>With the Cloud Act extending U.S. jurisdiction to any server operated by a U.S. company, if you entrust them with your data, you&#x27;ll find it to be under the control of the U.S. government, no matter where it&#x27;s physically stored.</p>
<p>In a recent example, after the Trump administration passed sanctions against the chief prosecutor of the International Criminal Court (ICC), Mr. Khan had <a href="https://apnews.com/article/icc-trump-sanctions-karim-khan-court-a4b4c02751ab84c09718b1b95cbd5db3">his email account suspended</a> by Microsoft, forcing him to resort to Switzerland-based <a href="https://proton.me/mail">ProtonMail</a>.</p>
<p>As for hosting your data in a location of your choice, in an sworn testimony before a French Senate inquiry, <a href="https://www.forbes.com/sites/emmawoollacott/2025/07/22/microsoft-cant-keep-eu-data-safe-from-us-authorities/">Microsoft France&#x27;s director of public and legal affairs, was asked whether he could guarantee that French citizen data would never be transmitted to U.S. authorities without explicit French authorization. And, he replied, &quot;No, I cannot guarantee it.&quot;</a></p>
<p>The same applies to any digital service offering by U.S. tech firms.</p>
<p>While entrusting U.S. tech companies with personal information is a choice everyone has to make for themselves, doing so with customer data is not only negligent, it is <a href="https://wire.com/en/blog/cloud-act-eu-data-sovereignty">likely to be against the law</a>.</p>
<h3>Open Source to the Rescue</h3>
<p>Considering the above, combined with our desire to maintain as much control over our infrastructure as possible, cloudscale has made a decision to rely on external service providers as little as possible. We design, build and operate our own services, on our own hardware, as much as we can.</p>
<h4>FOSS FTW!</h4>
<p>While <a href="https://en.wikipedia.org/wiki/Free_and_open-source_software">Free and open source software (FOSS)</a> can sometimes feel a bit chaotic, it promises to best fulfill specific use cases and is open to contributions by anyone, ensuring maximum amounts of transparency, trustworthiness, and adaptability.
It allows us to customize and extend functionality based on our needs, and we strive to contribute our improvements back upstream for the greater good of other parties relying on these projects.</p>
<p>Because if many people and organizations share a common technological need - like running a web server for example - the most reliable, transparent and customizable solution will often end up being an open source project, where everyone collaborates to achieve the best possible result.</p>
<p>Microsoft spent hundreds of millions trying to establish their proprietary, commercial web server <a href="https://iis.net">Internet Information Server (IIS)</a> as the tool of choice, only to get pretty much <a href="https://w3techs.com/technologies/history_overview/web_server/ms/y">wiped off the face of the internet</a> by open source solutions.</p>
<h4>FOSS @cloudscale</h4>
<p>The vast majority of our customer-facing services are built from open source components managed by us, tailored to our specific needs.
Our operating system of choice is Linux, with <a href="https://debian.org">Debian</a> being our favorite distribution.
Front-facing web services are handled by <a href="https://haproxy.org">HAproxy</a> and <a href="https://nginx.org">nginx</a>.
We use <a href="https://ceph.io">Ceph</a> for storage, and <a href="https://openstack.org">OpenStack</a> for our cloud service offering.</p>
<h3>Running Our Own Stack</h3>
<p>While we do use some commercial software like an on-prem <a href="https://gitlab.com">GitLab</a> instance, we also try to rely on free open source components for our daily work wherever reasonably possible.</p>
<p>Centralized authentication of our employees happens through an <a href="https://openldap.org">OpenLDAP</a> directory, with the credentials required to operate cloudscale infrastructure stored in an internal <a href="https://github.com/hashicorp/vault">Vault</a>-based setup.
Our monitoring is based on <a href="https://zabbix.com">Zabbix</a>, and we rely on <a href="https://grafana.com/oss/loki/">Loki</a> and <a href="https://grafana.com/oss/grafana/">Grafana</a> for log aggregation and analysis.
We use <a href="https://netboxlabs.com/docs/netbox/">Netbox</a> to manage our hardware inventory, our database needs are fulfilled by <a href="https://postgresql.org">PostgreSQL</a> and <a href="https://mariadb.com">MariaDB</a> clusters.
For our custom-tailored mail setup, we use components like <a href="https://postfix.org">Postfix</a>, <a href="https://dovecot.org">Dovecot</a> and <a href="https://rspamd.com">Rspamd</a>.</p>
<p>Combined with our high standards regarding security, reliability and maintainability, the design, implementation and operation of these services takes significant time and effort. We like to call the extra work required to take a technical implementation from fullfilling the basic requirements to something everyone in the team can fully agree to and be proud of &quot;vergolden&quot; (gold plating).</p>
<p>And while doing so not only for customer-facing components, but applying the same rigorous standards to our internal service landscape - used by a fairly small number of employees - might seem extreme, we feel like the benefits outweigh the drawbacks, since this level of attention to detail usually guarantees we won&#x27;t have to worry about a certain component for a long time.</p>
<h4>Ownership and Responsibility</h4>
<p>As a cloud service provider, self-hosting is the only approach giving us full control over our own and our customer digital assets.</p>
<p>The reasoning behind, and the benefits of this approach are:</p>
<h4>Security and Privacy</h4>
<ul>
<li>We can mitigate any threats as soon as we become aware.</li>
<li>Our internal communications never pass through third-party systems.</li>
<li>Our employee records aren&#x27;t sitting in some overseas data center.</li>
<li>Our customers can rest assured their information is only processed by systems managed by us.</li>
</ul>
<h4>Transparency</h4>
<ul>
<li>Relying on open source software allows us to review any code we use.</li>
<li>We know exactly what’s running on our systems, there are no black boxes or mystery outages.</li>
<li>If something misbehaves or breaks, we can analyze the anomaly and learn from it, preventing similar kinds of incidents from happening in the future.</li>
</ul>
<h4>Adaptability</h4>
<ul>
<li>We can customize and optimize any component to our specific needs, whether it&#x27;s adapting it to our specific use case, or tweaking it for maximum performance.</li>
<li>We’re not limited by someone else&#x27;s roadmap.</li>
</ul>
<h4>Resilience and Independence</h4>
<ul>
<li>When you rely on SaaS, you’ll find yourself at the mercy of someone else’s uptime, policy changes, or business decisions.</li>
<li>By building our own infrastructure, we gain a higher degree of operational independence.</li>
<li>We&#x27;re not tied to the fate of a service or company that might get acquired, discontinued, or priced out of reach.</li>
</ul>
<h4>In for the long game</h4>
<ul>
<li>While paying an external company for a service subscription might be attractive short-term, the cost benefits tend to diminish over time.</li>
<li>Once we have our tools set up up and working the way we want, only little maintenance is required over the course of the next few years.</li>
<li>While the initial effort required might be high, our total cost of ownership decreases over time.</li>
</ul>
<h4>Skill Development</h4>
<ul>
<li>Managing our own infrastructure pushes us to develop deep technical skills. We stay sharp. We learn. We grow.</li>
<li>That experience directly translates into better services for our customers, as well as a stronger internal culture of capability, confidence and ownership.</li>
</ul>
<h4>Not Always Easy - But Worth It</h4>
<p>Self-hosting takes time, skills, and a willingness to put in the required work. But the return on investment — in control, security, and integrity — is priceless.</p>
<p>For us, it’s not about rejecting SaaS entirely, but about being intentional. Wherever control, customization or independence matters, we choose to own the stack. Where commodification makes sense, we might consider outsourcing.</p>
<h3>Digital Sovereignty</h3>
<p>Digital sovereignty means being in control over your digital destiny - your IT infrastructure, data, and operations. It ensures your authority over how your data is stored, who can access it, and how your systems are run.</p>
<h4>Key Aspects</h4>
<ul>
<li>Digital Ownership: Full control over where data is stored, who accesses it, and how it’s processed.</li>
<li>Software Autonomy: Freedom to debug, modify, and adapt software to your needs without vendor-imposed limitations.</li>
<li>Infrastructure Independence: Ability to host your services wherever you feel comfortable.</li>
</ul>
<h3>Final Thoughts</h3>
<p>In a world dominated by a few large players, digital sovereignty might seem like a radical act. For us, it&#x27;s merely a conscious commitment to ownership, responsibility, and ultimately freedom.</p>
<p>And that&#x27;s a choice we’ll keep making.</p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[The Magic of uv within IntelliJ IDEA
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2025/08/22/uv-mit-intellij-verwenden</link>
          <pubDate>Fri, 22 Aug 2025 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2025/08/22/uv-mit-intellij-verwenden</guid>
          <description>
            <![CDATA[<p>Learn how to effectively use <code>uv</code> package manager within IntelliJ IDEA for Python projects, with practical setup steps and workarounds for current limitations.</p>]]>
          </description>
          <content:encoded><![CDATA[<link rel="preload" as="image" href="https://www.cloudscale.ch/img/uv-add-sdk.png"/><link rel="preload" as="image" href="https://www.cloudscale.ch/img/uv-project-sdk.png"/><link rel="preload" as="image" href="https://www.cloudscale.ch/img/uv-module-sdk.png"/><link rel="preload" as="image" href="https://www.cloudscale.ch/img/uv-external-tool.png"/><link rel="preload" as="image" href="https://www.cloudscale.ch/img/uv-run-config.png"/><link rel="preload" as="image" href="https://www.cloudscale.ch/img/uv-test-config.png"/><p>Recently, I took the time to delve into the <a href="https://docs.astral.sh/uv/">uv</a> package and project manager,
which you&#x27;ve likely heard of if you&#x27;re involved in the Python community in any way. <code>uv</code> is maintained by the
same company as the <a href="https://docs.astral.sh/ruff/">Ruff</a> linter and code formatter. The command-line interface (CLI)
command felt fast and intuitive. But of course, I wanted to use IntelliJ IDEA for my project and
at the same time take advantage of <code>uv</code>&#x27;s features. This process was much less straightforward than I had expected,
since while there is <code>uv</code> support in IntelliJ IDEA, it&#x27;s currently not as well integrated as it could be.</p>
<p>That&#x27;s why I have decided to write a blog post about how I currently set up my <code>uv</code>-based projects,
primarily for my future self as a reference - but I guess it could also be helpful to others even if you
have no prior experience with <code>uv</code>. In that sense, this blog is opinionated and just reflects my
current approach.</p>
<p>It&#x27;s important to note that this is a snapshot, everything might change with future versions.</p>
<h3>The Goal and Intro to this Guide</h3>
<p>For me, these are the minimal requirements for a good development environment:</p>
<ul>
<li>No, or at least almost no, typing of commands into the terminal. Instead, I want to rely on Run Configurations
and similar first-class IDE features.</li>
<li>An exception to the above point is the initialization of a project, where I actually prefer the CLI.</li>
<li>Imports must be correctly resolved within the IDE</li>
<li>Debugger must work for the production as well as test code</li>
<li>IDE is aware of the Python dependencies installed in the Project</li>
</ul>
<p>If you want to follow along, please prepare the following:</p>
<ul>
<li>A <code>uv</code> binary in your <code>PATH</code> (see <a href="https://docs.astral.sh/uv/getting-started/installation/">Installing uv</a>)</li>
<li>An IntelliJ IDEA installation with the following plugins installed: Python and Python Community Edition. PyCharm should work as well, but I did not test it.</li>
<li>Installed Python interpreter is optional, we&#x27;ll use <code>uv</code> to set one up.</li>
</ul>
<p>As of this writing the uv version I am using is <code>0.8.0</code>, together with IntelliJ IDEA <code>2025.2</code> and version
<code>252.23892.458</code> of the two Python plugins.</p>
<h3>Setting Up a Project with <code>uv</code></h3>
<p>In this section, we will set up a project with <code>uv</code> and add some code. The goal is to see <code>uv</code> and some
of its highlights in action.</p>
<p>Let&#x27;s go ahead and create a new project. I pass the <code>--package</code> flag to <code>uv init</code> since I prefer to
use <a href="https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/">src-layout</a>.</p>
<pre><code class="language-bash">mkdir my-hello-project
cd my-hello-project
uv init --package
</code></pre>
<p>The working tree now looks like this:</p>
<pre><code class="language-text">.
|-- .git [truncated]
|-- .gitignore
|-- .python-version
|-- pyproject.toml
|-- README.md
`-- src
    `-- my_hello_project
        `-- __init__.py
</code></pre>
<p>Since this is pretty boring, let&#x27;s add some code to the project, including two dependencies:</p>
<pre><code class="language-bash"># Empty the __init__.py file
cat &lt;&lt; EOF &gt;  src/my_hello_project/__init__.py
EOF

# Add some non-trivial code that requires a dependency
cat &lt;&lt; EOF &gt; src/my_hello_project/hello.py
import cowsay
import sys
import random


def run_hello() -&gt; None:
    characters = [&quot;cow&quot;, &quot;tux&quot;]
    random_char = random.choice(characters)
    python_version, *_ = sys.version.split()
    print(cowsay.get_output_string(random_char, f&quot;Running on {python_version}&quot;))
EOF

# Make my_hello_project a module
cat &lt;&lt; EOF &gt; src/my_hello_project/__main__.py
from my_hello_project.hello import run_hello

if __name__ == &quot;__main__&quot;:
    run_hello()
EOF

# Create tests
mkdir tests
cat &lt;&lt; EOF &gt; tests/test_hello.py
import sys

import pytest

from my_hello_project.hello import run_hello


def test_hello(capsys: pytest.CaptureFixture) -&gt; None:
    # act
    run_hello()

    # assert
    major, minor, micro, *_ = sys.version_info
    expected = f&quot;{major}.{minor}.{micro}&quot;
    captured = capsys.readouterr()
    assert expected in captured.out
EOF
</code></pre>
<p>Now we use <code>uv</code> to add the two dependencies to <code>pyproject.toml</code>:</p>
<pre><code class="language-bash">uv add cowsay
uv add --dev pytest
</code></pre>
<p>And now we are ready for some action. Let&#x27;s run the tests:</p>
<pre><code class="language-bash">uv run -m pytest      # or -m my_hello_project to run the application
</code></pre>
<p>The cool thing is that in this command multiple things happen:</p>
<ul>
<li>A python interpreter was downloaded if none was found in your <code>PATH</code>.</li>
<li>The <code>uv.lock</code> file that pins the exact versions of all dependencies was created.</li>
<li>A virtual env was created in <code>.venv</code>.</li>
<li>The dependencies were installed.</li>
<li>An editable install of <code>my_hello_project</code> was created (see <code>uv pip freeze</code>).</li>
<li>The tests were run.</li>
</ul>
<h2>Running the Project from the IDE</h2>
<p>Now let&#x27;s get our project running smoothly from within the IDE!
Here are my step-by-step instructions on how to run the project from the IDE:</p>
<p>Important: Do not confuse <code>my-hello-project</code> (the name of the project) and <code>my_hello_project</code> (the name of the package)</p>
<ol>
<li>Open the main <code>my-hello-project</code> directory from within IntelliJ IDEA.</li>
<li>Set up a Python SDK (Screenshot 1):
<ul>
<li>Navigate to File &gt; Project Structure &gt; Project Settings &gt; SDKs.</li>
<li>Click the <code>+</code> button and select <code>Add Python SDK from Disk</code>.</li>
<li>In the dialog, select the following values:
<ul>
<li>Environment: Select Existing</li>
<li>Type: uv</li>
<li>Path to uv: <code>/path/to/your/uv</code></li>
<li>Uv venv use: <code>/path/to/python3/in/your/.venv/directory</code></li>
</ul>
</li>
<li>Remember the name of the SDK you have just created.</li>
</ul>
</li>
<li>Set the SDK for the project (Screenshot 2):
<ul>
<li>Navigate to File &gt; Project Structure &gt; Project Settings &gt; Project</li>
<li>Select the SDK you have just created in the SDK dropdown.</li>
</ul>
</li>
<li>Set the SDK for the module (Screenshot 3):
<ul>
<li>Navigate to File &gt; Project Structure &gt; Modules &gt; my_hello_project &gt; Dependencies</li>
<li>Select the SDK you have just created in the Dependency dropdown.</li>
</ul>
</li>
<li>Create a new Run Configuration (Screenshots 4 and 5):
<ul>
<li>Navigate to Run &gt; Edit Configurations</li>
<li>Click the <code>+</code> button and select <code>Python</code></li>
<li>Ensure the following values are set:
<ul>
<li><code>Use SDK of Module</code>: <code>my-hello-project</code></li>
<li><code>module</code>: <code>my_hello_project</code></li>
</ul>
</li>
<li>Add a &quot;Before Launch&quot; task:
<ul>
<li>Click &quot;Modify Options&quot;</li>
<li>Choose &quot;Add before launch task&quot;</li>
<li>Choose &quot;Run External tool&quot; and click <code>+</code>:</li>
<li>Choose the following values and close the dialog:
<ul>
<li>Name: <code>uv sync</code></li>
<li>Program: <code>uv</code></li>
<li>Arguments: <code>sync</code></li>
<li>Working directory: <code>$ProjectFileDir$</code></li>
</ul>
</li>
<li>Caveat: If your &quot;Before Launch&quot; task is not shown in the Run Configuration,
make sure to not only set the checkbox but also select the <code>uv sync</code> entry before clicking <code>OK</code>.</li>
</ul>
</li>
</ul>
</li>
</ol>
<div style="display:grid;grid-template-columns:repeat(3, 1fr);grid-template-rows:repeat(2, min-content);width:100%"><img width="200" src="https://static.cloudscale.ch/img/uv-add-sdk-edd0e2b11db9.png" alt="uv-add-sdk.png" caption="Screenshot 1: Adding the Python Interpreter"/><img width="200" src="https://static.cloudscale.ch/img/uv-project-sdk-d5584f5677a0.png" alt="uv-project-sdk.png" caption="Screenshot 2: Set the SDK for the project"/><img width="200" src="https://static.cloudscale.ch/img/uv-module-sdk-f96d1d27282a.png" alt="uv-module-sdk.png" caption="Screenshot 3: Set the SDK for the module"/><img width="200" src="https://static.cloudscale.ch/img/uv-external-tool-7cd8b8b50024.png" alt="uv-external-tool.png" caption="Screenshot 5: The External Tool Configuration"/><img width="200" src="https://static.cloudscale.ch/img/uv-run-config-5fb88a072b9e.png" alt="uv-run-config.png" caption="Screenshot 5: The Completed Run Configuration"/><img width="200" src="https://static.cloudscale.ch/img/uv-test-config-66f9ba6fc55d.png" alt="uv-test-config.png" caption="Screenshot 6: The Test Run Configuration"/></div>
<p>Now you can use the Run Configuration to run the project as expected. You should see some
ASCII art and the Python version printed in the Run Output.</p>
<pre><code class="language-text">  _________________
| Running on 3.13.5 |
  =================
                 \
                  \
                    ^__^
                    (oo)\_______
                    (__)\       )\/\
                        ||----w |
                        ||     ||
</code></pre>
<p>Probably, you are wondering what the &quot;Before Launch&quot; task is for? <code>uv sync</code> ensures that the
virtual environment is up to date and uses a Python version that matches the specifier in
<code>.python-version</code>. Let&#x27;s see that in action by switching to a different Python version (a pre-release
in this case):</p>
<pre><code class="language-bash">echo &#x27;3.14.0b4&#x27; &gt; .python-version
</code></pre>
<p>and run the Program again through the Run Configuration and you are now on the pre-release version:</p>
<pre><code class="language-text">  ___________________
| Running on 3.14.0b4 |
  ===================
                        \
                         \
                          \
                           .--.
                          |o_o |
                          |:_/ |
                         //   \ \
                        (|     | )
                       /&#x27;\_   _/`\
                       \___)=(___/
</code></pre>
<p><code>uv</code> as successfully upgraded the Python version used in the virtual env, even without us
thinking a single second about any CLI commands. Isn&#x27;t that great? :-)</p>
<p>Setting up a Run Configuration for tests is straight forward, just right-click the <code>tests</code> folder
and select &quot;Run &#x27;Python tests in tests&#x27;&quot; and you are set (Screenshot 6). I&#x27;d suggest then also
adding a &quot;Before Launch&quot; task to run <code>uv sync</code> as described above. An even better approach is
to the &quot;Before Launch&quot; task to the relevant Run/Debug Configuration Templates.</p>
<h3>Epilogue: The &quot;uv run&quot; Run Configuration and Setting up an Entrypoint Script</h3>
<p>You might have noticed that there&#x27;s also a Run Configuration called &quot;uv run.&quot;
So, why did I opt for the lower level &quot;Python&quot; type with a &quot;Before Launch&quot; task
to synchronize the virtual environment and execute the program?
I faced two challenges:</p>
<ul>
<li>I only could get it to work in &quot;Debug&quot; mode, but not in &quot;Run&quot; (see also <a href="https://youtrack.jetbrains.com/issue/PY-83207/uv-run-config-is-not-working-in-run-mode-but-in-debug-mode">this issue</a>).</li>
<li>When upgrading to Python <code>3.14.0b4</code> it stopped working altogether.</li>
</ul>
<p>But I guess that&#x27;s just a matter of time and the issue will be resolved.</p>
<p>Finally, if your application is a CLI program, it&#x27;s a good practice to set up an entrypoint script.
This allows users to run your program with a simple command when installed on their system.
For example, if you want your program to be callable by the name <code>hello</code>, set up the following:</p>
<pre><code class="language-bash">cat &lt;&lt; EOF &gt;&gt; pyproject.toml
[project.scripts]
hello = &quot;my_hello_project.hello:run_hello&quot;
EOF
</code></pre>
<p>The program can now be run with <code>uv run hello</code>. Or if you build and install it:</p>
<pre><code class="language-bash">uv build
pipx install dist/*.whl
hello
</code></pre>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[From Retro Frustration to Tech Debt Clarity: How We Prioritize
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2025/07/29/von_retro_frustration_zu_klarheit_über_technische_schulden_wie_wir_prioritäten_setzen</link>
          <pubDate>Tue, 29 Jul 2025 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2025/07/29/von_retro_frustration_zu_klarheit_über_technische_schulden_wie_wir_prioritäten_setzen</guid>
          <description>
            <![CDATA[<p>Want to learn how our development team tackled the common struggle of prioritizing tech debt and what it has to do with beers and our next vacations?</p>]]>
          </description>
          <content:encoded><![CDATA[<link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-tech-debt-board.png"/><h3>The Problem: Everything Feels Important</h3>
<p>During our regular retrospectives, we frequently encounter various tech debt issues, such as
bringing up older code to our current coding standards and style,
overly complex code that could benefit from refactoring,
and libraries that require replacement or upgrading.</p>
<p>As engineers, our instinct is to want to fix everything – it&#x27;s in our nature. But with
limited time and competing product priorities, we needed a better way to decide what
actually deserved our attention first.</p>
<h3>The Solution: A 4x4 matrix</h3>
<p>That&#x27;s when we decided to dedicate a focused workshop session to really understand our
tech debt. I&#x27;d recently come across an idea from Dave Farley&#x27;s &quot;Modern Software Engineering&quot;
YouTube channel in his video <a href="https://www.youtube.com/watch?v=OS6gzabM0pI">&quot;Why Software Estimations Are Always Wrong,&quot;</a>
and thought we could adapt it for our needs.</p>
<p>The concept is simple: a 4x4 matrix that frames tech debt prioritization
through two key questions:</p>
<ul>
<li><strong>Y-Axis (Value)</strong>: How much would you pay a colleague if they solved this issue?</li>
<li><strong>X-Axis (Cost)</strong>: How much would you want to get paid to solve this issue yourself?</li>
</ul>
<p>Rather than falling into traditional estimation methods, we embraced deliberately
imprecise but universally relatable categories: <strong>a beer, a vacation, a car, or a house</strong>.
These rough buckets allowed the team to make <strong>quick value-versus-cost comparisons</strong> without
getting caught in the false precision that typically derails prioritization discussions.</p>
<h3>Setting Up the Workshop</h3>
<p>We gathered the entire development team. I prepared a large 4x4 grid on our
virtual whiteboard, with &quot;beer&quot; to &quot;house&quot; marked on both axes. We initially
started with the same layout orientation shown in Dave Farley&#x27;s original video,
but after working with it for some time, we found we needed to change the approach.
We switched to positioning the origin (0,0) a.k.a. (Beer,Beer) at the bottom-left corner – the standard
Cartesian coordinate system we&#x27;re all familiar with. This conventional layout proved
much more intuitive when working with more than a few stickies.</p>
<p>Each team member wrote down tech debt items on sticky notes, and we placed them
on the matrix based on our collective assessment. The discussions that emerged
were incredibly valuable.</p>
<h3>The Results Spoke for Themselves</h3>
<p>What emerged was a clear visual representation of our priorities:</p>
<ul>
<li><strong>Top-left</strong>: High value, low cost – our obvious quick wins</li>
<li><strong>Bottom-right</strong>: Low value, high cost – items we should probably ignore</li>
<li><strong>Bottom-left</strong>: Low value, low cost – nice-to-haves</li>
<li><strong>Top-right</strong>: High value, high cost - Strategic investments worth planning for</li>
</ul>
<img src="https://static.cloudscale.ch/img/engineering-blog-tech-debt-board-bf6eb3a37dcf.png" alt="Our template for the 4x4 matrix." caption="Our template for the 4x4 matrix. The color gradient aids in identifying issues with a good cost-to-value ratio."/>
<h3>Beyond Prioritization: Building Shared Understanding</h3>
<p>The workshop&#x27;s real success wasn&#x27;t just the prioritized items we created.
It was the shared understanding we developed as a team. We had <strong>honest
conversations</strong> about complexity and about the pain points we each face
daily.</p>
<p>As an additional advantage, the resulting matrix serves as an effective tool for
communicating with stakeholders, facilitating discussions about the
trade-offs between competing priorities.</p>
<h3>Lessons Learned</h3>
<p>Sometimes the simplest tools are the most powerful. A few sticky notes, a
virtual whiteboard, and some beer-to-house analogies gave us the clarity we&#x27;d been
missing for months.</p>
<p>There is just one final thing I need to note:
none of our developers actually owns or wants a car.
Maybe we should go with <strong>a beer, GA Travelcard, a vacation, or a house</strong>
next time ;-).</p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Insights from a Professional Third-Party Penetration Test
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2025/05/21/was-uns-ein-externer-penetrationstest-gelehrt-hat</link>
          <pubDate>Wed, 21 May 2025 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2025/05/21/was-uns-ein-externer-penetrationstest-gelehrt-hat</guid>
          <description>
            <![CDATA[<p>At cloudscale, we take security very seriously. Our approach includes rigorous code and architecture reviews, extensive automated testing, and leveraging battle-tested open-source frameworks like <a href="https://www.djangoproject.com/">Django</a>. For a recent external penetration test in the context of our ISO certification, we engaged a specialized provider to examine our <a href="https://control.cloudscale.ch">Control Panel</a> and <a href="https://api.cloudscale.ch/">API</a> for potential security vulnerabilities.</p>]]>
          </description>
          <content:encoded><![CDATA[<p>We provided the penetration testers with access to the source code and several dedicated test user accounts, giving them as much access as possible to help them identify any weaknesses.
We are satisfied with the overall results of the test, reflecting the robustness of our services.
As expected, there are always opportunities to learn and improve, and we&#x27;ve identified areas where we can further enhance our systems, which we have promptly addressed.</p>
<h3>XSS in the Login Form</h3>
<p>Let us start with the most critical finding.
After user authentication, our code executed <code>window.location = variable</code> to redirect users to a predefined page.
What we overlooked was that <code>window.location</code> accepts JavaScript URIs, creating an unexpected cross-site scripting vulnerability.
While we assumed the worst-case scenario would be users landing on unintended pages, attackers could actually exploit this by setting values like this:</p>
<pre><code class="language-javascript">window.location = &quot;javascript:alert(&#x27;xss&#x27;)&quot;
</code></pre>
<p>The penetration testers demonstrated how crafting a malicious URL and sending it to legitimate users, and hope that he logs into our control panel using that link could execute code after authentication.
This experience serves as an important reminder that even seemingly innocuous code patterns can harbor significant security risks.</p>
<p>We fixed this vulnerability in less than 24 hours and conducted a review of our logs to ensure no signs of exploitation were present.</p>
<h3>Outdated Dependencies</h3>
<p>Providing the source code allowed the testers to examine our dependencies in detail, uncovering several vulnerabilities related to known CVEs with recent fixes. Fortunately, we use <a href="https://github.com/renovatebot/renovate">Renovate</a> to automatically keep our dependencies up to date. Our self-hosted Renovate bot continuously monitors for new CVEs and creates merge requests as soon as patches are available.</p>
<p>As a result, the vulnerabilities identified during the test were already addressed before the report reached our inbox.</p>
<h3>Proxy Headers</h3>
<p>Another feature that immediately caught the attention of the penetration testers was the <a href="https://www.cloudscale.ch/en/api/v1#custom-image-imports">Custom Image Import</a> functionality. Instructing our servers to download content from “any” location seemed suspicious to them, we suppose. They discovered that our forward proxy, through which our internal systems perform downloads, included <code>X-Forwarded-For</code> headers that contained internal host names. Additionally, the versions of the <code>requests</code> and similar libraries were revealed.</p>
<p>As a consequence, we have disabled these headers in the meantime.</p>
<h3>Staying Vigilant</h3>
<p>External penetration tests are invaluable for uncovering blind spots and reinforcing security practices. While we are satisfied with the results of this test, we know that maintaining security is an ongoing effort. That is why we have also added a <code>.well-known/security.txt</code> file to our website, making it easy for security researchers to find contact information for responsible disclosure. A big thank-you goes out to <a href="https://blog.hartwork.org/posts/companies-fail-to-serve-security-txt-rfc-9116/">Sebastian Pipping</a> for nudging us to implement this.</p>
<p>If you host a site or service, we highly recommend adding a <a href="https://securitytxt.org/">security.txt</a> file as well. It is a simple but effective way to facilitate vulnerability reporting.</p>
<p>Stay safe!</p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Korrektes Quoting von nicht-interaktiven SSH-Kommandos
]]></title>
          <link>https://www.cloudscale.ch/de/engineering-blog/2025/05/05/korrektes-quoting-von-nicht-interaktiven-ssh-kommandos</link>
          <pubDate>Mon, 05 May 2025 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/de/engineering-blog/2025/05/05/korrektes-quoting-von-nicht-interaktiven-ssh-kommandos</guid>
          <description>
            <![CDATA[<p>Ich stolpere immer wieder mal über ungenügendes Quoting und Escaping von Kommandos, die ich via SSH ausführen will. In diesem Artikel versuche ich zu beischreiben, wie das Problem genau auftritt, und wie das Problem mithilfe von Features von modernen Shells automatisiert gelöst werden kann.</p>]]>
          </description>
          <content:encoded><![CDATA[<h3>Eine einfache Aufgabe</h3>
<p>Ich verwende eine typische, modernen Unix-Shell und habe ein einfaches Ziel: Ich möchte mithilfe von <a href="https://manpages.debian.org/bookworm/grep/grep.1.en.html"><code>grep</code></a> alle Skripts unter <code>/etc/init.d</code> finden, welche den String <code>case &quot;$1&quot; in</code> enthalten:</p>
<pre><code class="language-text">michi@myserver$ grep -RF &#x27;case &quot;$1&quot; in&#x27; /etc/init.d
/etc/init.d/udev:case &quot;$1&quot; in
/etc/init.d/nullmailer:case &quot;$1&quot; in
/etc/init.d/dbus:case &quot;$1&quot; in
/etc/init.d/haveged:case &quot;$1&quot; in
...
</code></pre>
<blockquote>
<p>Ich verwende <a href="https://www.gnu.org/software/bash/">Bash 5.2</a>. Eine andere <a href="https://www.reddit.com/r/linux4noobs/comments/12wfzb8/best_shell_in_your_opinion_2023/">populäre</a> Shell unter Linux-Benutzer*innen ist die <a href="https://www.zsh.org">Z shell</a> (<code>zsh</code>). Alles hier beschriebene funktioniert auch unter der Z shell. Wo Unterschiede bestehen, werde ich das anmerken.</p>
</blockquote>
<p>Das funktioniert gut. Nun möchte ich das Gleiche automatisiert auf mehreren Servern tun. <a href="https://manpages.debian.org/bookworm/openssh-client/ssh.1.en.html">OpenSSH</a> erlaubt es, auf der Kommandozeile, nach allen anderen Argumenten, ein Kommando anzugeben. Das Kommando wird auf dem Server ausgeführt, dann wir die Verbindung wieder getrennt:</p>
<pre><code class="language-text">$ ssh myserver uname -sr
Linux 5.4.0-165-generic
</code></pre>
<p>Also kann ich folgendes tun:</p>
<pre><code class="language-text">$ ssh myserver grep -RF &#x27;case &quot;$1&quot; in&#x27; /etc/init.d
grep: : No such file or directory
grep: in: No such file or directory
/etc/init.d/udev:    case &quot;$type&quot; in
/etc/init.d/udev:case &quot;$1&quot; in
</code></pre>
<p>Leider funktioniert das nicht, oder nur so halb. Es sieht aus, als wäre <code>grep</code> mit ganz anderen Argumenten aufgerufen worden, als ich angegeben habe. Um dem auf den Grund zu gehen, werde ich die folgende Shell-Funktion verwenden. Diese füge ich am Anfang von meinem <code>~/.bashrc</code> (<code>~/.zshrc</code> unter der Z shell) ein, sowohl lokal als auch auf einem der Server, auf die ich mich verbinden möchte:</p>
<pre><code class="language-shell">show-args() {
    printf &quot;%s\n&quot; &quot;$@&quot; | nl -ba
}
</code></pre>
<p><code>show-args</code> gibt mir die Möglichkeit, statt ein Kommando auszuführen, zu sehen, welche Argumente an das Programm übergeben worden wären. Wenn ich das lokal teste, sehe ich folgendes:</p>
<pre><code class="language-text">$ show-args grep -RF &#x27;case &quot;$1&quot; in&#x27; /etc/init.d
     1	grep
     2	-RF
     3	case &quot;$1&quot; in
     4	/etc/init.d
</code></pre>
<p><code>grep</code> würde also mit 3 Argumenten aufgerufen. Den Flags, dem String, nach dem gesucht werden soll, und dem Verzeichnispfad, welcher durchsucht werden soll. Dass dieses Kommando an <code>ssh</code> übergeben wird, ändert daran nichts. <code>ssh</code> wird mit <code>grep</code> als Kommando und den gleichen 3 Argumenten aufgerufen:</p>
<pre><code class="language-text">$ show-args ssh myserver grep -RF &#x27;case &quot;$1&quot; in&#x27; /etc/init.d
     1	ssh
     2	myserver
     3	grep
     4	-RF
     5	case &quot;$1&quot; in
     6	/etc/init.d
</code></pre>
<p>Also muss das Problem auf der anderen Seite liegen. Indem wir <code>show-args</code> und <code>ssh myserver</code> auf der Kommandozeile vertauschen, können wir überprüfen, wie <code>grep</code> auf dem Server aufgerufen wird:</p>
<pre><code class="language-text">$ ssh myserver show-args grep -RF &#x27;case &quot;$1&quot; in&#x27; /etc/init.d
     1	grep
     2	-RF
     3	case
     4
     5	in
     6	/etc/init.d
</code></pre>
<p>Der Such-String <code>case &quot;$1&quot; in</code> wurde also aufgeteilt in 3 Argumente, als wäre dieser ohne Anführungszeichen geschrieben worden. Was ist hier passiert?</p>
<h3>SSH und nicht-interaktive Kommandos</h3>
<p>Die <a href="https://manpages.debian.org/bookworm/openssh-client/ssh.1.en.html#DESCRIPTION">Manpage von ssh</a> gibt uns einen Hinweis für das beobachtete Verhalten:</p>
<blockquote>
<code>If a <em>command</em> is specified, it will be executed on the remote host instead of a login shell. A complete command line may be specified as <em>command</em>, or it may have additional arguments. If supplied, the arguments will be appended to the command, separated by spaces, before it is sent to the server to be executed.</code>
</blockquote>
<p>Wenn <code>ssh</code> mit einem auszuführenden Kommando aufgerufen wird, verwendet der SSH-client den <code>&quot;exec&quot;</code>-Request (<a href="https://www.rfc-editor.org/rfc/rfc4254#section-6.5">RFC 4254, Abschnitt 6.5</a>). Dieser sieht vor, dass ein einzelner String als Kommando mitgegeben wird. Auf dem Server wird dieser dann an die Login-Shell übergeben und von dieser ausgeführt.</p>
<ol>
<li><code>ssh</code> erhält als Kommando 5 Argumente <code>show-args</code>, <code>grep</code>, <code>-RF</code>, <code>case &quot;$1&quot; in</code> und <code>/etc/init.d</code>. Hier ist wichtig zu bemerken, dass die einfachen Anführungszeichen (<code>&#x27;&#x27;</code>) nicht teil des 3. Arguments sind. Diese wurden von der lokal laufenden Shell bereits entfernt.</li>
<li><code>ssh</code> fügt die Argumente zu einem String zusammen: <code>show-args grep -RF case &quot;$1&quot; in /etc/init.d</code>. Dieser wird via der SSH-Verbindung an den Server geschickt.</li>
<li>Die Shell auf dem Server parst diesen String und macht daraus 7 Teile: <code>show-args</code>, <code>grep</code>, <code>-RF</code>, <code>case</code>, <code>&quot;$1&quot;</code>, <code>in</code> und <code>/etc/init.d</code>.</li>
<li>Da die einfachen Anführungszeichen entfernt wurden, wird das <code>$1</code> als Verwendung einer Variable behandelt. Da diese Variable nicht gesetzt ist, wird sie durch den leeren String ersetzt. Das Resultat ist ein Argument der Länge 0.</li>
<li>Die Funktion <code>show-args</code> wird mit den restlichen 6 Argumenten aufgerufen.</li>
</ol>
<p>Teil des Problems ist, dass das eingetippte Kommando zweimal geparst wird, einmal von der lokalen Shell und ein zweites Mal von der Shell auf dem Server. Die einfachen Anführungszeichen um <code>&#x27;case &quot;$1&quot; in&#x27;</code> reichen also nicht aus. Eine Möglichkeit ist, das Argument in eine zweite Klammerung an Anführungszeichen einzupacken. Zusätzlich müssen alle Zeichen, die für eine Unix-Shell eine spezielle bedeutung haben, mit <code>\</code> escaped werden:</p>
<pre><code class="language-text">$ ssh myserver show-args grep -RF &quot;&#x27;case \&quot;\$1\&quot; in&#x27;&quot; /etc/init.d
     1	grep
     2	-RF
     3	case &quot;$1&quot; in
     4	/etc/init.d
</code></pre>
<p>Nun wird das Kommando beim Parsen auf dem Server also in die gewünschten 4 Teile unterteilt. Und wenn wir <code>show-args</code> aus dem Kommando wieder entfernen, funktioniert das Kommando auch wie gewünscht:</p>
<pre><code class="language-text">$ ssh myserver grep -RF &quot;&#x27;case \&quot;\$1\&quot; in&#x27;&quot; /etc/init.d
/etc/init.d/udev:case &quot;$1&quot; in
/etc/init.d/nullmailer:case &quot;$1&quot; in
/etc/init.d/dbus:case &quot;$1&quot; in
/etc/init.d/haveged:case &quot;$1&quot; in
...
</code></pre>
<h3>Eine automatisierte Lösung</h3>
<p>Das Quoting und Escaping im letzten Beispiel hat funktioniert, aber es kann mühsam und unübersichtlich werden.</p>
<p>Da ich häufiger in diese Situation komme, habe ich mir ein einfaches Werkzeug gebaut, dass das Problem etwas abschwächt: Ich verwende eine weitere Shell-Funktion, welche ich <code>q</code> (für &quot;Quoting&quot;) nenne:</p>
<pre><code class="language-shell"># Bash
q() { echo &quot;${@@Q}&quot;; }
# Z shell
q() { echo &quot;${(q)@}&quot;; }
</code></pre>
<p>Diese Funktion kann auch wieder in <code>~/.bashrc</code> bzw. <code>~/.zshrc</code> abgelegt werden.</p>
<p>Die Funktion verwendet ein Feature der jeweiligen Shell um Strings automatisch zu quoten (<a href="https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html">Bash Manual, Shell Parameter Expansion</a> bzw. <a href="https://zsh.sourceforge.io/Doc/Release/Expansion.html#Parameter-Expansion-Flags">Z shell Manual, Parameter Expansion Flags</a>). Der Q-Operator macht sozusagen das Entfernen von Anführungszeichen rückgängig, das beim Parsen durch die Shell passiert.</p>
<p>Um die Funktion anzuwenden, muss einfach das ganze Kommando, welches an <code>ssh</code> übergeben wird, in <code>$(q ...)</code> eingepackt werden. Keine zusätzlichen Anführungszeichen sind notwendig:</p>
<pre><code class="language-text">$ ssh myserver show-args $(q grep -RF &#x27;case &quot;$1&quot; in&#x27; /etc/init.d)
     1	grep
     2	-RF
     3	case &quot;$1&quot; in
     4	/etc/init.d
</code></pre>
<pre><code class="language-text">$ ssh myserver $(q grep -RF &#x27;case &quot;$1&quot; in&#x27; /etc/init.d)
/etc/init.d/udev:case &quot;$1&quot; in
/etc/init.d/nullmailer:case &quot;$1&quot; in
/etc/init.d/dbus:case &quot;$1&quot; in
/etc/init.d/haveged:case &quot;$1&quot; in
...
</code></pre>
<p>Ein Anwendungsfall, bei dem dies sehr praktisch sein kann, ist, wenn in das Kommando Variablen eingesetzt werden sollen:</p>
<pre><code class="language-text">$ for i in &#x27;##&#x27; &#x27;$1&#x27; &#x27;a &gt; b&#x27;; do
&gt;   ssh myserver $(q echo &quot;\$i = $i&quot;);
&gt; done
$i = ##
$i = $1
$i = a &gt; b
</code></pre>
<p>Viele weitere Anwendungen sind möglich, z.B. kann der Output von <code>q</code> in eine Datei geschrieben werden, welche später ausgeführt werden kann.</p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[DIY AI Chatbot with Ollama, Open WebUI & DeepSeek-R1 on NVIDIA L40S
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2025/04/14/ki-chatbot-marke-eigenbau</link>
          <pubDate>Mon, 14 Apr 2025 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2025/04/14/ki-chatbot-marke-eigenbau</guid>
          <description>
            <![CDATA[<p>I used the launch of the dedicated GPUs as an opportunity to show off some of the new possibilities it adds to our platform. In this Engineering Blog Post I&#x27;m going through the setup of a self-hosted AI chatbot using Ollama and Open WebUI, powered by the DeepSeek-R1 70B model running on one of our brand-new NVIDIA L40S GPUs.</p>]]>
          </description>
          <content:encoded><![CDATA[<link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-diy-ai-chatbot-conversation.png"/><h3>But why? There are already many free AI chatbots available…</h3>
<p>With so many AI chatbots available for free, why go through the trouble of setting up a self-hosted one, beside technical curiosity? The answer lies in the often repeated and always neglected concerns about privacy, data protection, and trust:</p>
<ul>
<li><strong>Privacy &amp; data protection:</strong> Your chat conversations are often logged, analyzed, and stored. Even if providers claim to anonymize data, you can never be certain how your information is being used.</li>
<li><strong>Company secrets:</strong> Using an external AI chatbot risks exposing trade secrets, internal strategies, or confidential client information.</li>
<li><strong>If a service is free, you are the product:</strong> The meanwhile worn-out phrase still holds true.</li>
</ul>
<h3>Evaluation</h3>
<blockquote>
<p>Disclaimer: I did not spend too much time for the evaluation of the stack and got a lot of suggestions of my teammate Alain, which already has some more experience with LLMs (<a href="https://en.wikipedia.org/wiki/Large_language_model">large language model</a>).</p>
</blockquote>
<p>TLDR:</p>
<ul>
<li>Web UI: <a href="https://github.com/open-webui/open-webui">Open WebUI</a></li>
<li>LLM Management: <a href="https://ollama.com/">Ollama</a></li>
<li>LLM: <a href="https://ollama.com/library/deepseek-r1:70b">DeepSeek-R1 70B</a></li>
</ul>
<h4>Web-Based Chat Interface</h4>
<p>In my short research, I found two very promising Open-source tools as a AI chatbot Web UI:</p>
<ul>
<li><a href="https://github.com/open-webui/open-webui">Open WebUI</a>: Simpler, 2 Docker services</li>
<li><a href="https://github.com/danny-avila/LibreChat">LibreChat</a>: More advanced, ~5 Docker services</li>
</ul>
<p>I just went for to seemingly more light-weight solution, which is Open WebUI.</p>
<h4>LLM Management Tool</h4>
<p>Because <a href="https://ollama.com/">Ollama</a> is a really easy to use solution for managing LLMs and is natively supported by Open WebUI (no additional configuration needed) and also installs all necessary drivers and dependencies, I just went with this tool.</p>
<h4>Large Language Model</h4>
<p>Here comes the tricky part, selecting the right model. To ensure maximum performance, it needs to fit into the GPUs VRAM. The NVIDIA L40S GPUs, provided at cloudscale, has 48 GB of GDDR6 VRAM. Even though it is possible to use multiple L40S GPUs to run even larger models, I wanted to use a single GPU for this setup. The VRAM requirements can be found on the Ollama website or other model repositories. The <a href="https://ollama.com/library/deepseek-r1:70b">DeepSeek-R1 70B</a> model requires approximately 41 GB of VRAM, making it a great fit for this GPU.</p>
<h3>Instructions</h3>
<h4>Setting up a GPU server on cloudscale</h4>
<blockquote>
<p>GPU servers are subject to <a href="https://www.cloudscale.ch/en/gpu.pdf">Addendum for GPU servers</a> / <a href="https://www.cloudscale.ch/de/gpu.pdf">Vertragszusatz für GPU-Server</a>. If you are interested or have any questions, <a href="mailto:support@cloudscale.ch">please contact support</a>.</p>
</blockquote>
<p>I just created a GPU VM via the cloudscale control panel with the following specifications:</p>
<ul>
<li>Flavor: GPU1-160-20-1-400 (could also be a GPU flavor with less RAM or fewer CPU cores, the GPU is what matters)</li>
<li>GPU Type: 1x NVIDIA L40S</li>
<li>Scratch Disk: 400 GB on RAID 1 (this will be handy for persisting the LLM)</li>
<li>Source Image: Debian 12 - Bookworm</li>
</ul>
<p>If you want to follow this guide, I strongly recommend using Debian as the base image, as it ensures that all the steps will work as expected.</p>
<h4>Mount the Scratch Disk</h4>
<p>Even though we can load models up to 48 GB directly into the NVIDIA L40S&#x27;s VRAM, we need to download and persist the models before we can run them. For this reason, at cloudscale, every GPU server comes equipped with an additional, local Scratch Disk. The Scratch Disk is tied directly to the server, but offers better performance than our usual volumes. Like other volumes, we have to mount it in our VM first:</p>
<pre><code class="language-bash"># List all disks
lsblk -l

# Create a new folder for the mount
sudo mkdir -p /mnt/scratch

# Identify and mount Scratch Disk, e.g.: /dev/sdb
sudo mount /dev/sdb /mnt/scratch

# Get the device&#x27;s UUID (note it somewhere or copy to clipboard)
sudo blkid /dev/sdb

# Persist the Scratch Disk mount
sudo vim /etc/fstab

# Add the following line
UUID=&lt;device-uuid&gt; /mnt/scratch ext4 defaults 0 0
</code></pre>
<p>Later, we will configure Ollama to use the Scratch Disk to download and persist the models.</p>
<h4>Manually install NVIDIA drivers (optional)</h4>
<blockquote>
<p>This step can be skipped entirely, because the <a href="https://ollama.com/download/linux">Ollama install script</a> will install all necessary dependencies automatically. Follow this guide, if you are interested in how to manually install the NVIDIA GPU drivers on a VM.</p>
</blockquote>
<p>In the following section, I will give you a step-by-step guide for installing the necessary NVIDIA GPU driver on Debian 12 - Bookworm.</p>
<pre><code class="language-bash"># SSH into the rebooted server
ssh debian@&lt;public-ip&gt;

# Upgrade Debian to the latest version
# In the presented dialog, you can just confirm the default setting
sudo apt update &amp;&amp; sudo apt upgrade -y

# Edit the sources list to enable non-free software (e.g. NVIDIA Drivers)
sudo vim /etc/apt/sources.list
</code></pre>
<p>The file should be updated as follows:</p>
<pre><code class="language-bash"># See /etc/apt/sources.list.d/debian.sources
deb http://deb.debian.org/debian bookworm main contrib non-free non-free-firmware
deb http://deb.debian.org/debian-security bookworm-security main contrib non-free non-free-firmware
deb http://deb.debian.org/debian bookworm-updates main contrib non-free non-free-firmware
</code></pre>
<pre><code class="language-bash"># Update the package list
sudo apt update

# Install the NVIDIA driver package
# Multiple dialogs will pop up, I just confirmed the default settings as they seemed reasonable enough
sudo apt install nvidia-driver

# Reboot VM as recommended
sudo reboot

# SSH into the rebooted server
ssh debian@&lt;public-ip&gt;

# Verify that the NVIDIA GPU drivers are installed and the GPU is correctly recognized by the VM
nvidia-smi
</code></pre>
<h4>Install <a href="https://ollama.com/">Ollama</a></h4>
<pre><code class="language-bash"># SSH into the server
ssh debian@&lt;public-ip&gt;

# Download and execute the Ollama install script
curl -fsSL https://ollama.com/install.sh | sh

# Reboot VM as recommended
sudo reboot

# SSH into the rebooted server
ssh debian@&lt;public-ip&gt;

# Verify that the NVIDIA GPU drivers are installed and the GPU is correctly recognized by the VM
nvidia-smi

# Verify that Ollama is installed successfully
ollama -v
sudo systemctl status ollama

# Configure Ollama to use the mounted Scratch Disk instead of the root volume
sudo mkdir -p /mnt/scratch/ollama_models
sudo chown ollama:ollama /mnt/scratch/ollama_models/

# Edit the ollama service configuration
sudo vim /etc/systemd/system/ollama.service

# Add the following line as the last one in the [Service] section
Environment=&quot;OLLAMA_MODELS=/mnt/scratch/ollama_models&quot;

# Reload the ollama service
sudo systemctl daemon-reload
sudo systemctl restart ollama

# Download a small model to verify that it&#x27;s stored on the Scratch Disk, the directory should not be empty any more
ollama pull smollm
ls /mnt/scratch/ollama_models/

# Cleanup the unused model
ollama rm smollm

# Install and run DeepSeek&#x27;s R1
ollama run deepseek-r1:70b

# You can now test the LLM via CLI, exit with ctrl+d
</code></pre>
<h4>Secure your service (optional)</h4>
<blockquote>
<p>If you plan to keep the service online, it&#x27;s strongly advised to follow this section, or to implement alternative measures to protect it from unwanted access. This setup requires that you have a domain name.</p>
</blockquote>
<h4>Install nginx with certbot</h4>
<p>In this guide, I will install and configure the web server <a href="https://nginx.org">nginx</a> with <a href="https://certbot.eff.org">certbot</a>, to restrict the access and enable HTTPS for Open WebUI. Before your start, make sure that your domain&#x27;s or subdomain&#x27;s A and AAAA records are pointing to the GPU server&#x27;s public IPv4 and IPv6 addresses respectively.</p>
<pre><code class="language-bash"># Install nginx, certbot and certbot nginx plugin
sudo apt install nginx certbot python3-certbot-nginx

# Verify installation
sudo nginx -v

# Edit the nginx default configuration
sudo vim /etc/nginx/sites-available/default
</code></pre>
<pre><code class="language-nginx"># Replace &lt;your-ip-v4&gt; and &lt;your-ip-v6&gt; with the addresses you are using to access the internet.
# If you don&#x27;t want to protect your service from unwanted access or you have other measures in place (e.g.: HTTP BasicAuth), the following three lines can be removed.
allow &lt;your-ip-v4&gt;;
allow &lt;your-ip-v6&gt;;
deny all;

upstream open_webui {
    # We will configure Open WebIU to listen on this port
    server 127.0.0.1:8080;
}

server {
    listen 80 default_server;
    listen [::]:80 default_server;

    root /var/www/html;

    index index.html;

    # Replace &lt;your-domain&gt; with the domamin where you configured the A and AAAA records, the certbot plugin will also use this to automatically extend this file with the HTTPS configuration
    server_name &lt;your-domain&gt;;

    # The proxy configuration is taken from: https://docs.openwebui.com/tutorials/https-nginx/#steps
    location / {
            proxy_pass http://open_webui;

            # Add WebSocket support (Necessary for version 0.5.0 and up)
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection &quot;upgrade&quot;;

            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            # (Optional) Disable proxy buffering for better streaming response from models
            proxy_buffering off;

            # (Optional) Increase max request size for large attachments and long audio messages
            client_max_body_size 20M;
            proxy_read_timeout 10m;
        }
}
</code></pre>
<pre><code class="language-bash"># Verify that the nginx config is valid
sudo nginx -t

# Reload configuration
sudo nginx -s reload

# Setup certbot
sudo certbot --nginx -d &lt;your-domain&gt;

# Check the changes made by certbot:
# There should be new entries &#x27;managed by certbot&#x27;
cat /etc/nginx/sites-available/default
</code></pre>
<p>Now you can open <code>https://&lt;your-domain&gt;/</code> in a browser to verify if the TLS Certificate is correctly setup for your domain, and the page is served via HTTPS. The response will be &quot;502 Bad Gateway&quot;, as the Open WebUI service is not reachable yet. If you receive a &quot;403 Forbidden&quot;, this probably means you have to change the nginx configuration.</p>
<h4>Install UFW (<a href="https://wiki.debian.org/Uncomplicated%20Firewall%20%28ufw%29">Uncomplicated Firewall</a>)</h4>
<pre><code class="language-bash"># Install UFW
sudo apt install ufw

# Allow egress (we will monitor egress with OpenSnitch)
sudo ufw default allow outgoing

# Block ingress
sudo ufw default deny incoming

# Allow services, if you forget to allow SSH you will be blocked from accessing the VM!
sudo ufw allow ssh
sudo ufw allow vnc
sudo ufw allow http
sudo ufw allow https

# Enable UFW
sudo ufw enable

# Check status
sudo ufw status

# Verify that apt and certbot still can do their job
sudo apt update
sudo certbot renew --dry-run
</code></pre>
<h4>Install Open WebUI</h4>
<p>In this section, I will install <a href="https://github.com/open-webui/open-webui">Open WebUI</a> into a Python Virtualenv, but there are other installation methods (e.g.: with <a href="https://github.com/open-webui/open-webui?tab=readme-ov-file#quick-start-with-docker-">Docker</a>).</p>
<pre><code class="language-bash"># Install dependencies
sudo apt install python3 python3-venv

# Verify python version is 3.11 to avoid compatibility issues
python3 --version

# Create a new user for running Open WebUI
sudo useradd -m open_webui

# Change directory and switch to user open_webui
cd /home/open_webui
sudo su open_webui

# Create a new directory and add a virtual environment
python3 -m venv venv

# Install Open WebUI in the virtual environment
venv/bin/pip install open-webui

# Start Open WebUI via the virtual environment, the default port is 8080
venv/bin/open-webui serve
</code></pre>
<p>You can now verify that your own Open WebUI is running, and create your admin account:</p>
<ul>
<li>If you skipped the &quot;Secure your service&quot; section: <code>http://&lt;public-ip&gt;:8080</code></li>
<li>Otherwise: <code>https://&lt;your-domain&gt;</code></li>
</ul>
<p>Now let&#x27;s create a SystemD service, which runs Open WebUI in the background:</p>
<pre><code class="language-bash"># Ctrl+c to stop Open WebUI and exit from the user open_webui
exit

# Create a new systemd service configuration
sudo vim /etc/systemd/system/open-webui.service
</code></pre>
<pre><code class="language-ini">[Unit]
Description=Open WebUI Service
After=network.target

[Service]
Type=simple
WorkingDirectory=/home/open_webui
ExecStart=/home/open_webui/venv/bin/open-webui serve
KillSignal=SIGTERM
KillMode=mixed
Restart=always
RestartSec=3
StandardOutput=syslog
StandardError=syslog
User=open_webui
Group=open_webui

[Install]
WantedBy=multi-user.target
</code></pre>
<pre><code class="language-bash"># Reload systemd
sudo systemctl daemon-reload

# Enable and start open-webui service
sudo systemctl enable open-webui.service
sudo systemctl start open-webui.service

# Check for any errors
sudo systemctl status open-webui.service
</code></pre>
<img src="https://static.cloudscale.ch/img/engineering-blog-diy-ai-chatbot-conversation-6f363c9d7291.png" alt="Asking DeepSeek-R1 via Open WebUI to tell a joke." caption="Asking DeepSeek-R1 via Open WebUI to tell a joke."/>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Today I Learned: GitLab Fleeting Edition
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2025/03/27/heute-lernte-ich-gitlab-fleeting-ausgabe</link>
          <pubDate>Thu, 27 Mar 2025 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2025/03/27/heute-lernte-ich-gitlab-fleeting-ausgabe</guid>
          <description>
            <![CDATA[<p>How we collaborated with Puzzle ITC to publish a Fleeting plugin for GitLab, enabling our customers to autoscale their GitLab CI workload.</p>]]>
          </description>
          <content:encoded><![CDATA[<h3>Today I Learned: GitLab Fleeting Edition</h3>
<p>In collaboration with <a href="https://www.puzzle.ch/">Puzzle ITC</a>&#x27;s <a href="https://github.com/ioboi">Yannik Dällenbach</a>, I have recently worked to publish <a href="https://github.com/cloudscale-ch/fleeting-plugin-cloudscale"><code>fleeting-plugin-cloudscale</code></a>. With this plugin it is possible to dynamically scale Gitlab runner instances on cloudscale.ch.</p>
<p>While the bulk of the work is all thanks to Yannik&#x27;s efforts, I had the chance to integrate his work into our open source landscape. This is what I learned.</p>
<h5>GitLab Fleeting Works Well</h5>
<p>Originally, GitLab used the Docker machine driver, to offer autoscaled runner instances. This tool has been deprecated for years. The alternative takes the form of a Go plugin.</p>
<p>With only a few short functions, Yannik was able to implement the necessary logic to launch, list, and cleanup runner instances. It&#x27;s easy to write such a framework in a way that makes things cumbersome. GitLab has avoided that pitfall.</p>
<p>The GitLab runner process uses the plugin to react to tasks popping up in the CI pipeline, ensuring there is always enough capacity when it is actually needed.</p>
<h4>Fleeting Plugin Packaging Is Different</h4>
<p>I figured that publishing this plugin would involve a container image. This turned out to be right, but how GitLab does it was a bit unexpected:</p>
<p>Using a bespoke <a href="https://gitlab.com/gitlab-org/fleeting/fleeting-artifact"><code>fleeting-artifact</code></a> command, we build a container image, given a set of architecture-specific binaries. Go&#x27;s cross-compilation story is top-notch, so providing these binaries is easy, but it is unexpected to not use generic container image build tools.</p>
<p>I think the approach has its merits, but I might have saved some time, had I known about it sooner. I only realized my container images did not work, after I built them using <a href="https://ko.build/"><code>ko</code></a>.</p>
<h4>Go 1.24 Tools</h4>
<p>This was my first project where I got to try out Go 1.24&#x27;s <code>go tool</code> command. It is used to integrate tools related to the development of the project into dependency tracking. In our case, this enabled me to integrate <code>go-releaser</code> and the mentioned <code>fleeting-artifact</code>.</p>
<p>To install these tools, I ran the following:</p>
<pre><code class="language-bash">go get -tool -modfile tool.mod github.com/goreleaser/goreleaser/v2@latest
go get -tool -modfile tool.mod gitlab.com/gitlab-org/fleeting/fleeting-artifact/cmd/fleeting-artifact
go mod tidy -modfile tool.mod
</code></pre>
<p>This makes these tools available as follows:</p>
<pre><code class="language-bash">go tool -modfile tool.mod fleeting-artifact
go tool -modfile tool.mod goreleaser
</code></pre>
<p>What I like about this approach, though it is a bit verbose, is the fact that I can run the exact same tool locally, and in the CI, and across all our workstations.</p>
<p>Dev-tooling should be versioned as well, to keep its results stable, and <code>go tool</code> accomplishes that.</p>
<h4>Zizmor</h4>
<p>I like linters, static analyzers, language-servers, and so on. Anything that supplements my knowledge with community-wisdom. I didn&#x27;t know about Zizmor before, but I rarely write GitHub workflows, and I figured: Someone must have written a validator for these.</p>
<p>Indeed someone has: <a href="https://github.com/woodruffw/zizmor">Zizmors</a> goal is to protect the user from the following (and more):</p>
<ul>
<li>Template injection vulnerabilities, leading to attacker-controlled code execution.</li>
<li>Accidental credential persistence and leakage.</li>
<li>Excessive permission scopes and credential grants to runners.</li>
<li>Impostor commits and confusable <code>git</code> references.</li>
</ul>
<p>It&#x27;s not precisely a validator, or at least I did not come to rely on it as such, but it pointed out some potential security issues with my GitHub workflows, and I learned some newer directives I missed before.</p>
<p>As a result: The GitHub workflows in-place for <code>fleeting-cloudscale-plugin</code> are now hardened and checked for issues on each commit.</p>
<h4>Conclusion</h4>
<p>Adopting and packaging someone else&#x27;s hard work was a somewhat new experience for me. With most of the groundwork already laid out, I was able to spend more time on packaging and testing, which the final result reflects I think.</p>
<p>If you want to check out our plugin, see:</p>
<p><a href="https://github.com/cloudscale-ch/fleeting-plugin-cloudscale">https://github.com/cloudscale-ch/fleeting-plugin-cloudscale</a></p>
<p>If you want to demo it, use our Ansible playbook that configures a GitLab instance, adds a Gitlab runner, the <code>fleeting-plugin-cloudscale</code> plugin, and optionally a distributed S3 cache, all in one go:</p>
<p><a href="https://github.com/cloudscale-ch/gitlab-runner">https://github.com/cloudscale-ch/gitlab-runner</a></p>
<p>You&#x27;ll be presented with a GitLab instance where CI jobs automatically work, scale, and share their cache. It&#x27;s a great demo, and a good starting point to introduce GitLab into your own organization. Everything inside our infrastructure, far away from <a href="https://en.wikipedia.org/wiki/Five_Eyes">Five Eyes</a>.</p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Improving metrics collection for our Object Storage
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2025/01/29/improving-metrics-collection-for-object-storage</link>
          <pubDate>Wed, 29 Jan 2025 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2025/01/29/improving-metrics-collection-for-object-storage</guid>
          <description>
            <![CDATA[<p>How do we know how much we should charge for your Object Storage usage? A journey into <code>rgw-metrics</code>.</p>]]>
          </description>
          <content:encoded><![CDATA[<link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-rgw-metrics-object-stoage-tab.png"/><p>Cloudscale offers S3-compatible Object Storage built on top of our Ceph storage cluster with three-fold replication.
To provide the S3 service, we use Ceph&#x27;s <a href="https://docs.ceph.com/en/reef/radosgw/">RADOS Gateway</a> (radosgw). While radosgw includes built-in usage tracking,
we found its metrics insufficient for the needs of a public cloud provider like us.</p>
<img src="https://static.cloudscale.ch/img/engineering-blog-rgw-metrics-object-stoage-tab-31e2162c70d9.png" alt="Historical Object Storage usage data in the Control Panel." caption="Historical Object Storage usage data in the Control Panel."/>
<p>In the Objects tab in our Control Panel, customers can view their exact usage over time and as an example
see the number of requests on a specific date. This detailed data is not just for customer insights,
it is also critical for accurate billing. Beside the number of requests, the metrics include the number of objects,
the used storage and the network traffic.</p>
<p>To bridge the gap in capabilities, we developed our own solution: <code>rgw-metrics</code>.</p>
<h3>What is rgw-metrics?</h3>
<p><code>rgw-metrics</code> is a microservice that repeatedly collects the current usage data for every bucket from radosgw.
This data is aggregated into the current hourly segments, which are persisted.
This usage data is then queried by the Control Panel through an API provided by <code>rgw-metrics</code>.
This API is quite narrow and was stable over the years.
It only allows to fetch metrics for a single or for multiple object users.</p>
<pre><code>┌────────────────┐       ┌───────────────┐       ┌─────────────────┐
│  Contol Panel  ├──────►│  rgw-metrics  ├──────►│  RADOS Gateway  │
└────────────────┘       └───────────────┘       └─────────────────┘
</code></pre>
<p>Designed as a standalone microservice, running on both of our sites, means it operates independently of the
Control Panel. This independence ensures metrics are consistently collected, even during extended maintenance periods.</p>
<h3>A journey of evolution</h3>
<p>The first version of <code>rgw-metrics</code> was written in <a href="https://flask.palletsprojects.com/en/stable/">Flask</a> back in 2017
when we first <a href="https://www.cloudscale.ch/en/news/2017/06/30/launch-of-our-s3-compatible-object-storage">introduced our S3 storage</a>.
While functional, the application had received little maintenance since its launch.
Over time, this led to challenges, the outdated dependencies, manual deployment steps and the fact that the Control
Panel is build with a different framework, <a href="https://www.djangoproject.com/">Django</a>, made engineers cautious about
touching the application.</p>
<p>To address these issues, we decided on a black-box rewrite of <code>rgw-metrics</code>, transitioning it from Flask to Django.</p>
<h3>The black-box rewrite approach</h3>
<p>To ensure a seamless transition, we prioritized maintaining the existing API&#x27;s behavior.
That way we were able to create a collection of tests to validate the new service against the existing one.
For instance, we compared the historical usage data from our public acceptance tests over the past year.
Together with countless other internal projects using the Object Storage. During the development,
we ran a script to compare the output of the new Django-based service with the original Flask-based implementation.
This ensured the output of the new service matched the old one under various scenarios.</p>
<pre><code class="language-sh"># essentially, it was automating these steps:
curl -H &quot;$AUTH_HEADER&quot; &quot;https://old-api.cloudscale.ch/v1/metrics/buckets?start=2023-12-31&amp;end=2024-01-01&quot; &gt; &quot;export_flask/metrics.json&quot;
curl -H &quot;$AUTH_HEADER&quot; &quot;https://api.cloudscale.ch/v1/metrics/buckets?start=2023-12-31&amp;end=2024-01-01&quot; &gt; &quot;metrics.json&quot;
diff export_flask/metrics.json metrics.json
</code></pre>
<p>Thanks to this test-driven method we acutely found multiple bugs, including one in our data migration scripts.
An existing column was copied to the wrong target column.</p>
<h3>What is up next for rgw-metrics?</h3>
<p>With the rewrite complete, <code>rgw-metrics</code> now benefits from up-to-date dependencies, a container based deployment,
similar to our main application, and a similar structure, which will help us develop additional features.</p>
<p>With the foundation strengthened, we are ready to tackle upcoming improvements like the efficient detection
of large buckets: Each bucket has a limit of 10 million objects. Beyond this threshold, performance may degrade.
Currently, we proactively contact users approaching this limit. However, gathering the necessary data through
the current endpoints is suboptimal, as it requires iterating over every bucket for each object user.
The amount of object users is growing every day, this forces us to extend the API to allow for more efficient
queries on large buckets, reducing overhead and improving responsiveness.</p>
<p>Stay tuned as we continue to enhance our metrics system and provide an even better Object Storage experience for our users.</p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Using Our Load Balancer to Set Up a Highly Available Kubernetes Control Plane
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2025/01/07/eine-hochverfügbare-kubernetes-control-plane-mit-einem-load-balancer-einrichten</link>
          <pubDate>Tue, 07 Jan 2025 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2025/01/07/eine-hochverfügbare-kubernetes-control-plane-mit-einem-load-balancer-einrichten</guid>
          <description>
            <![CDATA[<p>Set up a highly available Kubernetes control plane using cloudscale&#x27;s load balancer. This guide walks you through provisioning with Terraform, configuring Kubernetes, and testing failover step by step.</p>]]>
          </description>
          <content:encoded><![CDATA[<link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-lb-kube-api.png"/><p>One of the great things about working as an engineer at cloudscale is
that we get to work on many different products, technologies, and projects.
When we began developing the load balancer as a service product,
a crucial requirement was that it must be usable for creating highly available
Kubernetes control planes. During the whole
development process, I was looking forward to bootstrapping my first cluster using
this new product. And now, that we have this blog, I want to share
a few notes on how to create a highly available, stacked, Kubernetes control
plane on three cloudscale Ubuntu VMs using containerd.</p>
<h3>Provisioning the Cloud Infrastructure</h3>
<p>The <a href="https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/">Kubernetes Documentation</a>
instructs us to set up the following:</p>
<blockquote>
<p>In a cloud environment you should place your control plane nodes behind a
TCP forwarding load balancer. This load balancer distributes traffic to
all healthy control plane nodes in its target list. The health check for
an apiserver is a TCP check on the port the kube-apiserver listens on
(default value <code>:6443</code>).</p>
</blockquote>
<p>This means that we&#x27;ll need:</p>
<ul>
<li>A Load Balancer</li>
<li>A Load Balancer Listener using port <code>6443</code></li>
<li>A Load Balancer Pool with round-robin algorithm</li>
<li>A Load Balancer Pool Member for each VM</li>
<li>A Load Balancer Health Monitor checking Port <code>6443</code> on the VMs</li>
</ul>
<p>Since the easiest way to set all this up and get all the VMs running is with Terraform, I have
provided a Terraform file (see appendix) if you want a quick start. The Terraform
file setups of the following:</p>
<img src="https://static.cloudscale.ch/img/engineering-blog-lb-kube-api-4492c992665a.png" alt="Three VMs/nodes and a load balancer are connected privately. The load balancer consists of multiple VMs." caption="Three VMs/nodes and a load balancer are connected privately. The load balancer consists of multiple VMs."/>
<p><a href="https://developer.hashicorp.com/terraform/install">Ensure Terraform is installed on your machine</a>. Then, navigate to the Terraform file’s directory and run it.</p>
<pre><code class="language-text">terraform init
</code></pre>
<p>Once initialized, export a read/write API token from a, preferably, empty project and create the infrastructure by running:</p>
<pre><code class="language-text">export CLOUDSCALE_API_TOKEN=&quot;...&quot;
terraform apply
</code></pre>
<p>Terraform will display a preview of the resources it plans to create or update and prompt you for confirmation.
Type <code>yes</code> to proceed.</p>
<p>Terraform will also output three variables at the end: <code>kube_api_lb_ip</code>, <code>server_ips_private</code>,
and <code>server_ips_public</code>. We&#x27;ll need this information later on. Ensure that you can SSH
into all VMs using the <code>ubuntu</code> user using the <code>public</code> IP addresses.</p>
<h3>Installing kubeadm and containerd</h3>
<p>Now, fasten your seatbelt. Here is a condensed summary of the following
articles from the Kubernetes Documentation:
<a href="https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/">Installing kubeadm</a>,
<a href="https://kubernetes.io/docs/setup/production-environment/container-runtimes/">Container Runtimes</a>,
<a href="https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/">Creating Highly Available Clusters with kubeadm</a>. I take some
shortcuts and have left out some things I deem not relevant for non-production setups.</p>
<p>All commands must be run on all nodes.</p>
<p>Configure Kubernetes’ apt repository. Replace <code>1.32</code> with the desired Kubernetes version.</p>
<pre><code class="language-text">curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.32/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo &#x27;deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.32/deb/ /&#x27; | sudo tee /etc/apt/sources.list.d/kubernetes.list
</code></pre>
<p>Download the necessary packages.</p>
<pre><code class="language-text">sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
</code></pre>
<p>Enable IP forwarding.</p>
<pre><code class="language-text">echo &quot;net.ipv4.ip_forward = 1&quot; | sudo tee /etc/sysctl.d/k8s.conf
sudo sysctl --system
</code></pre>
<p>Install containerd and configure the systemd cgroup driver for runc.</p>
<pre><code class="language-text">sudo apt install -y containerd
sudo mkdir /etc/containerd
containerd config default | sed &#x27;s/SystemdCgroup = false/SystemdCgroup = true/&#x27; | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd
</code></pre>
<h3>Initializing the Cluster and Installing a CNI Plugin</h3>
<p>Next, we&#x27;ll initialize the cluster from <code>control-node-1</code> and install Cilium as a
CNI (Container Network Interface) plugin.
In the <code>kubeadm init</code> command, pass the IPv4 IP address of the load balancer as
<code>--control-plane-endpoint</code> (shown as <code>kube_api_lb_ip</code> in <code>terraform show</code>)
and the private IP address of node 1 as <code>--apiserver-advertise-address</code>.</p>
<pre><code class="language-text">sudo kubeadm init --control-plane-endpoint &quot;KUBE_API_LB_IP:6443&quot; --apiserver-advertise-address=&quot;10.11.12.21&quot; --upload-certs
</code></pre>
<p>Next, set up your <code>$HOME/.kube/config</code> file as shown in the output of <code>kubeadm init</code> and
keep the join commands in a safe place.</p>
<p>At this point, I usually check the nodes and pods in my cluster to see if everything looks good.
It&#x27;s perfectly normal that the node is <code>NotYetReady</code> and that <code>coredns</code> pods are not yet running.
But the other pods should be running.</p>
<pre><code class="language-text">kubectl get nodes -o wide
kubectl get pods -A -o wide
</code></pre>
<p>In our experience, Cilium is the most worry-free CNI plugin to install, so let&#x27;s do
that and admire some colorful ASCII art until it is ready.</p>
<pre><code class="language-text">wget https://github.com/cilium/cilium-cli/releases/download/v0.16.22/cilium-linux-amd64.tar.gz
tar xvf cilium-linux-amd64.tar.gz
./cilium install
./cilium status --wait
</code></pre>
<p>Now the node should be ready and <code>coredns</code> pods should also come up within a short amount of time.</p>
<h3>Joining the Other Nodes</h3>
<p>Now join the other two nodes into the cluster using kubeadm. The
command was shown in the <code>kubeadm join</code> output. Be sure to add
the <code>--apiserver-advertise-address=&quot;10.11.12.2x&quot;</code> using the
respective private IPs (<code>.22</code> and <code>.23</code>, shown as <code>server_ips_private</code> in <code>terraform show</code>).</p>
<pre><code class="language-text">sudo kubeadm join &lt;your-kube-api-lb-ip&gt;:6443 --token &lt;your-token&gt; \
  --discovery-token-ca-cert-hash sha256:&lt;your-ca-cert-hash&gt; \
  --control-plane --certificate-key &lt;your-certificate-key&gt; \
  --apiserver-advertise-address=&quot;10.11.12.&lt;your-server-private-ip&gt;&quot;
</code></pre>
<p>After a certain period, all nodes listed in <code>kubectl get nodes -o wide</code> should be marked as <code>Ready</code>. And the <code>coredns</code>
pods become running.</p>
<h3>Seeing It in Action: Shutting Down a Node</h3>
<p>Now, this blog post would, of course, not be complete without proofing
that we can take down a control node. I suggest that you copy the
<code>$HOME/.kube/config</code> to your local machine and use kubectl from there.</p>
<p>Let’s list all pods in the cluster. Pay attention to the <code>coredns</code> pods. They’re most likely scheduled on <code>control-node-1</code>.</p>
<pre><code class="language-text">kubectl get pods -A -o wide
</code></pre>
<p>Now drain <code>control-node-1</code>:</p>
<pre><code class="language-text">kubectl drain control-node-1 --ignore-daemonsets
</code></pre>
<p>After a few seconds, the <code>coredns</code> pods should have been successfully moved to the other control nodes.</p>
<pre><code class="language-text">kubectl get pods -A -o wide
</code></pre>
<p>It’s now safe to shut down the drained node.</p>
<pre><code class="language-text">sudo init 0
</code></pre>
<p>If everything worked, well, <code>kubectl</code> still works like a charm from
your local machine and the other two control nodes. Another fun thing
to do is to navigate to the private network named <code>backend</code> in the
cloudscale Control Panel and look at the Ports tab. There, you&#x27;ll
see the network ports of the load balancer and the VMs. <code>control-node-1</code>
should be shown as down. The same applies to the <code>&quot;monitor_status&quot;</code> property
if you query the pool members using our API.</p>
<p>After restarting the node, make it schedulable again.</p>
<pre><code class="language-text">kubectl uncordon control-node-1
</code></pre>
<p>And now you are ready to add worker nodes to the
cluster. Or installing our <a href="https://github.com/cloudscale-ch/cloudscale-cloud-controller-manager">Cloud Controller Manager (CCM)</a> to configure a Load Balancer
for managing external traffic, or setting up our
<a href="https://github.com/cloudscale-ch/csi-cloudscale">Container Storage Interface (CSI)</a> driver for persistent storage.</p>
<h3>Lessons Learned and Final Words</h3>
<p>In my initial test cluster, I made a mistake by not adding load balancer pool members for <code>control-node-2</code> and <code>control-node-3</code>. As a result, when I shut down <code>control-node-1</code>, everything stopped working. So once, again I was reminded of: HA systems are worthless if failover testing is not done.</p>
<p>I hope this guide was interesting to you. If you find this type of content valuable, please send us an email because this could be the beginning of a small miniseries on running Kubernetes on cloudscale Infrastructure. As mentioned earlier, there’s much more to cover.</p>
<p>Have fun experimenting with Kubernetes on our infrastructure, but please read the complete documentation linked above before deploying production workloads!</p>
<h3>Appendix: Terraform File</h3>
<pre><code class="language-HCL">terraform {
  required_providers {
    cloudscale = {
      source  = &quot;cloudscale-ch/cloudscale&quot;
      version = &quot;4.4.0&quot;
    }
  }
}

provider &quot;cloudscale&quot; {
  # Add your provider configuration here if necessary
}

variable &quot;control_node_count&quot; {
  description = &quot;Number of control nodes&quot;
  type        = number
  default     = 3
}

variable &quot;network_cidr&quot; {
  description = &quot;CIDR block for the backend network&quot;
  type        = string
  default     = &quot;10.11.12.0/24&quot;
}

variable &quot;zone_slug&quot; {
  description = &quot;Zone slug for the resources&quot;
  type        = string
  default     = &quot;lpg1&quot;
}

variable &quot;ssh_key_path&quot; {
  description = &quot;Path to the SSH public key file&quot;
  type        = string
  default     = &quot;~/.ssh/id_ed25519.pub&quot; # Replace with your SSH key file path
}

# Create a network
resource &quot;cloudscale_network&quot; &quot;backend&quot; {
  name                    = &quot;backend&quot;
  zone_slug               = var.zone_slug
  auto_create_ipv4_subnet = &quot;false&quot;
}

# Create a subnet
resource &quot;cloudscale_subnet&quot; &quot;backend-subnet&quot; {
  cidr         = var.network_cidr
  network_uuid = cloudscale_network.backend.id
}

# Server Group for Control Nodes
resource &quot;cloudscale_server_group&quot; &quot;control-plane-group&quot; {
  name      = &quot;control-plane-group&quot;
  type      = &quot;anti-affinity&quot;
  zone_slug = var.zone_slug
}

# Control Nodes
resource &quot;cloudscale_server&quot; &quot;control-nodes&quot; {
  count            = var.control_node_count
  name             = &quot;control-node-${count.index + 1}&quot;
  flavor_slug      = &quot;flex-8-4&quot;
  image_slug       = &quot;ubuntu-24.04&quot;
  volume_size_gb   = 50
  ssh_keys         = [file(var.ssh_key_path)]
  server_group_ids = [cloudscale_server_group.control-plane-group.id]
  zone_slug        = var.zone_slug

  interfaces {
    type = &quot;public&quot;
  }

  interfaces {
    type = &quot;private&quot;
    addresses {
      subnet_uuid = cloudscale_subnet.backend-subnet.id
      address     = &quot;10.11.12.${count.index + 21}&quot;
    }
  }
}

# Kube-API Load Balancer
resource &quot;cloudscale_load_balancer&quot; &quot;kube-api-lb&quot; {
  name        = &quot;kube-api-lb&quot;
  flavor_slug = &quot;lb-standard&quot;
  zone_slug   = var.zone_slug
}


# Create a load balancer pool
resource &quot;cloudscale_load_balancer_pool&quot; &quot;kube-api-pool&quot; {
  name               = &quot;kube-api-pool&quot;
  algorithm          = &quot;round_robin&quot;
  protocol           = &quot;tcp&quot;
  load_balancer_uuid = cloudscale_load_balancer.kube-api-lb.id
}

# Create a load balancer listener
resource &quot;cloudscale_load_balancer_listener&quot; &quot;kube-api-listener&quot; {
  name          = &quot;kube-api-listener&quot;
  pool_uuid     = cloudscale_load_balancer_pool.kube-api-pool.id
  protocol      = &quot;tcp&quot;
  protocol_port = 6443
}


# Create a load balancer pool member
resource &quot;cloudscale_load_balancer_pool_member&quot; &quot;kube-api-pool-member&quot; {
  count         = var.control_node_count
  name          = &quot;kube-api-${count.index}&quot;
  pool_uuid     = cloudscale_load_balancer_pool.kube-api-pool.id
  protocol_port = 6443

  # Get the private IP address of the control node
  address = flatten([
    for iface in cloudscale_server.control-nodes[count.index].interfaces : [
      for addr in iface.addresses : addr.address
      if iface.type == &quot;private&quot;
    ]
  ])[0]
  subnet_uuid = cloudscale_subnet.backend-subnet.id
}

# Create a load balancer pool member
resource &quot;cloudscale_load_balancer_health_monitor&quot; &quot;lb1-health-monitor&quot; {
  pool_uuid = cloudscale_load_balancer_pool.kube-api-pool.id
  type      = &quot;tcp&quot;
}


output &quot;kube_api_lb_ip&quot; {
  value       = cloudscale_load_balancer.kube-api-lb.vip_addresses[0].address
  description = &quot;IPv4 Address of the Load Balancer&quot;
}

output &quot;server_ips_public&quot; {
  value = [
    for node in cloudscale_server.control-nodes :
    flatten([
      for iface in node.interfaces : [
        for addr in iface.addresses : addr.address
        if iface.type == &quot;public&quot;
      ]
    ])[0]
  ]
  description = &quot;The public IP addresses of the control nodes.&quot;
}

output &quot;server_ips_private&quot; {
  value = [
    for node in cloudscale_server.control-nodes :
    flatten([
      for iface in node.interfaces : [
        for addr in iface.addresses : addr.address
        if iface.type == &quot;private&quot;
      ]
    ])[0]
  ]
  description = &quot;The private IP addresses of the control nodes.&quot;
}
</code></pre>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Executable Ansible Playbooks
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2024/12/12/ausfuehrbare-ansible-playbooks</link>
          <pubDate>Thu, 12 Dec 2024 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2024/12/12/ausfuehrbare-ansible-playbooks</guid>
          <description>
            <![CDATA[<p>Why we use <code>chmod +x</code> on our playbooks and why that is a good idea.</p>]]>
          </description>
          <content:encoded><![CDATA[<p>We use Ansible to provision, configure, and orchestrate our infrastructure. The code has grown over the years, and systems have sprung up that we need to interact with when a playbook is running. We manage our inventory in <a href="https://www.cloudscale.ch/en/news/2024/06/28/netbox-as-a-source-of-truth">Netbox</a>, we record playbook runs using <a href="https://ara.recordsansible.org/">ARA</a>, and store secrets in <a href="https://www.vaultproject.io/">Vault</a>/<a href="https://openbao.org/">OpenBao</a>.</p>
<p>As a result, our playbooks require more and more knowledge about our environment, and the chance of &quot;holding them wrong&quot; increases.</p>
<h3>How it Started</h3>
<p>Like most Ansible users, we initially called all playbooks like this:</p>
<pre><code class="language-bash">$ ansible-playbook playbooks/state/all.yml -i inventory/prod-rma1 -l www --diff
</code></pre>
<p>This works great, but it is more verbose than necessary. Some in our team, including - if not limited to - me, prefer short commands over long ones. Especially if they need to be used frequently.</p>
<p>So we took our existing playbooks like this one:</p>
<pre><code class="language-yaml">- name: Base network and boot setup
  hosts: &#x27;{{ playbook_hosts | default(&quot;ansible-managed&quot;) }}&#x27;
</code></pre>
<p>And added a <a href="https://en.wikipedia.org/wiki/Shebang_%28Unix%29">shebang</a>:</p>
<pre><code class="language-yaml">#!/usr/bin/env ansible-playbook
- name: Base network and boot setup
  hosts: &#x27;{{ playbook_hosts | default(&quot;ansible-managed&quot;) }}&#x27;
</code></pre>
<p>Then we made the file executable:</p>
<pre><code class="language-shell">$ chmod +x playbooks/state/all.yml
</code></pre>
<p>And presto, we could drop the command name from our CLI, calling the same playbook as follows:</p>
<pre><code class="language-shell">$ playbooks/state/all.yml -i inventory/prod-rma1 -l www
</code></pre>
<h3>How it is Going</h3>
<p>When integrating ARA logging, we discovered that operators would always have to use <code>--diff / -D</code>, for differences to actually be logged.</p>
<p>We figured that typing <code>--diff</code> is not something we wanted to add to our documentation. We wanted to enforce it. We also had some configuration we needed to apply, depending on the inventory selection.</p>
<p>Maybe there is another way, but we figured: Since we already use our playbooks somewhat like CLIs, why not wrap <code>ansible-playbook</code> and go all-in?</p>
<p>So that&#x27;s what we did. We wrote our own Python CLI called <code>ap</code>, that would inspect the arguments destined for <code>ansible-playbook</code>, and introduce some of its own arguments for a better user experience.</p>
<p>The script uses <a href="https://typer.tiangolo.com/">Typer</a> and looks roughly as follows (this is a minimized version to illustrate the concept):</p>
<pre><code class="language-python">import os
import subprocess
import sys

from typer import Context
from typer import Option
from typer import Typer
from typing import Annotated


cli = Typer(add_completion=False, add_help_option=False, no_args_is_help=True)


@cli.command(name=&#x27;ap&#x27;, context_settings={
    &quot;allow_extra_args&quot;: True,
    &quot;ignore_unknown_options&quot;: True
}, add_help_option=False, no_args_is_help=True)
def main(
    ctx: Context,
    inventory: Annotated[list[str], Option(&quot;--inventory&quot;, &quot;-i&quot;, help=(
        &quot;Inventories to use, each either a site or a full inventory path&quot;
    ))] = [],
) -&gt; None:
    args = [&#x27;ansible-playbook&#x27;, *(a for a in ctx.args)]

    # Configure inventories
    for i in inventory:
        i = i.removeprefix(&#x27;inventory/&#x27;)

        args.append(&#x27;-i&#x27;)
        args.append(f&#x27;inventory/{i}&#x27;)

    # Always use diff
    if &#x27;-D&#x27; not in args and &#x27;--diff&#x27; not in args:
        args.append(&#x27;--diff&#x27;)

    # Configure systems
    os.environ.update(ara_env(args))
    os.environ.update(netbox_env(args))
    os.environ.update(secrets_env(args))

    # Execute
    result = subprocess.run(args, env=os.environ)
    sys.exit(result.returncode)


if __name__ == &#x27;__main__&#x27;:
    cli()
</code></pre>
<p>Aside from enforcing arguments, this also gave us the flexibility to shorten our inventory calls a little, as the <code>inventory</code> prefix is really always the same. So while we started with this:</p>
<pre><code class="language-bash">$ ansible-playbook playbooks/state/all.yml -i inventory/prod-rma1 -l www --diff
</code></pre>
<p>We can now call this, which is equivalent:</p>
<pre><code class="language-bash">$ playbooks/state/all.yml -i prod-rma1 -l www
</code></pre>
<p>And now we can use all these saved keystrokes on useful things, like our engineering blog!</p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[pyastgrep: Python-Code anhand des ASTs und XPath durchsuchen
]]></title>
          <link>https://www.cloudscale.ch/de/engineering-blog/2024/11/29/python-code-anhand-des-asts-und-xpath-durchsuchen</link>
          <pubDate>Fri, 29 Nov 2024 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/de/engineering-blog/2024/11/29/python-code-anhand-des-asts-und-xpath-durchsuchen</guid>
          <description>
            <![CDATA[<p>Das effektive durchforsten von Code und präzise finden von bestimmten Code-Stellen wird in grossen Software-Projekten zu einer wichtigen Aufgabe. Werkzeuge wie Regular Expressions kommen irgendwann an ihre Grenzen. Ich zeige ein Werkzeug, mit dem Python-Code anhand seines ASTs durchsucht werden kann.</p>]]>
          </description>
          <content:encoded><![CDATA[<h3>Ein mittelgrosses Software-Projekt</h3>
<p>Hier bei cloudscale arbeite ich hauptsächlich am <em>Control-Panel</em>, einer mittelgrossen Python/TypeScript-Applikation. Das Control-Panel ist die Applikation, welche unter <a href="https://control.cloudscale.ch/">control.cloudscale.ch</a> erreichbar ist. Die Entwicklung begann vor 8 Jahren als Neu-Anfang und die Applikation wurde seither stetig weiterentwickelt und gepflegt.</p>
<p>In diesem Artikel konzentriere ich mich auf dem Python-Teil, welcher heute ~120 k Zeilen Code umfasst:</p>
<pre><code class="language-text">$ git ls-files &#x27;*.py&#x27; | wc -l
     674
$ git ls-files &#x27;*.py&#x27; | parallel -Xj1 cat | wc -l
  119248
</code></pre>
<h3>Suchen von Code-Stellen</h3>
<p>Wie bei jedem Software-Projekt, ist es bei der Arbeit am Control-Panel regelmässig notwendig, Code-Stellen, welche weit über die Applikation verteilt sind, zu finden oder anzupassen. Z.B. möchte ich herausfinden, ob eine interne API auf eine bestimmte Art verwendet wird oder ob diese überhaupt noch verwendet wird. Oder ich möchte prüfen, ob ein veraltetes oder problematisches Code-Muster, das ich entdeckt habe, noch an weiteren Stellen in der Applikation vorkommt. Das Ziel ist immer, den Code verständlicher und einfacher erweiterbar/wartbar zu machen.</p>
<p>Das erste Werkzeug, zu dem ich greife, ist &quot;Suchen &amp; Ersetzen&quot; meiner IDE oder ein anderes, ähnlichen Werkzeug. Hier kann ich mit <a href="https://www.regular-expressions.info">Regular Expressions</a> mit wenig Aufwand nach Stellen im Code suchen. Dieser Ansatz ist sehr schnell und daher interaktiv. Die Geschwindigkeit von <code>git grep</code> wird wohl durch kein Werkzeug erreich, welches zuerst den Python-Source parsen müsste:</p>
<pre><code class="language-text">$ time git grep -P &#x27;\bsend_email\(&#x27; &#x27;*.py&#x27; &gt; /dev/null

real	0m0.030s
user    0m0.014s
sys     0m0.098s
</code></pre>
<p>Aber Regular Expressions sind nicht geeignet, wenn komplexere zusammenhänge im Code erkennt werden müssen. Ein Beispiel wäre, alle Verwendungen einer Funktion zu finden, bei denen ein optionales Argument angegeben wird. Dafür gibt es besser geeignete Werkzeuge, welche allerdings auch mehr Vorbereitung bei der Anwendung benötigen. Im Folgenden Zeige ich eines der Werkzeuge, mit denen ich in den letzten Monaten viel gearbeitet habe.</p>
<h3>pyastgrep</h3>
<p><a href="https://github.com/spookylukey/pyastgrep/">pyastgrep</a> ist eine Library und eine CLI-Applikation, welche zum Durchsuchen von Python-Code anhand dessen AST (<a href="https://de.wikipedia.org/wiki/Syntaxbaum#Abstrakte_Syntaxb%C3%A4ume">Abstract syntax tree</a>) verwendet werden kann. pyastgrep stellt intern den Python-AST einer einzelnen Source-Datei oder eines ganzen Ordners als XML-Baum zur Verfügung. Diesen kann dann mit XPath-Ausdrücken durchsucht werden. Es ist dabei in jedem Fall sehr hilfreich, die Dokumentation des Python-Moduls <a href="https://docs.python.org/3.12/library/ast.html"><code>ast</code></a> und ein <a href="https://devhints.io/xpath">XPath Cheatsheet</a> bereitzuhalten.</p>
<p>Als Beispiel zeige ich, wie ich alle Code-Stellen finden kann, bei denen die Funktion <code>send_email()</code> aufgerufen und ein Wert für das optionale Argument <code>reply_to_address</code> angegeben wird. Dazu baue ich schrittweise einen XPath-Ausdruck auf.</p>
<p>Im ersten Schritt selektiere ich alle Funktions-Aufrufe von Funktionen mit dem Namen <code>send_email()</code>.</p>
<pre><code class="language-text">$ pyastgrep &#x27;.//Call[func/Name[@id=&quot;send_email&quot;]]&#x27; src
src/db/access/member_helper.py:57:9:        send_email(
src/services/openstack/functions.py:103:5:    send_email(
</code></pre>
<blockquote>
<p><code>Call</code> und <code>Name</code> sind die Knoten im Syntaxbaum, welche einen Funktionsaufruf respektive eine die Verwendung einer globalen oder lokalen Variable darstellen. <code>Name[@id=&quot;send_email&quot;]</code> selektiert alle Variablen-Verwendungen von Variablen mit dem name <code>send_email</code>.</p>
</blockquote>
<p>Soweit so gut! Ich weiss aber, dass weitaus mehr Code-Stellen diese Funktion aufrufen. Das Problem ist, dass die Funktion entweder als <code>send_email()</code> oder aber auch als <code>email.send_email()</code> aufgerufen werden kann. Da dies die einzige Funktion mit diesem Namen ist, kann ich etwas ungenau arbeiten, und alle Stellen selektieren, in denen auf einem beliebigen Objekt oder Modul eine Funktion mit diesem Namen aufgerufen wird:</p>
<pre><code class="language-text">$ pyastgrep &#x27;.//Call[func[Name[@id=&quot;send_email&quot;] or Attribute[@attr=&quot;send_email&quot;]]]&#x27; src
src/panel/signals.py:18:13:            email.send_email(
src/panel/invoices/__init__.py:169:30:    to_address, cc_address = email.send_email(
src/panel/payment/__init__.py:46:9:        email.send_email(
src/panel/email/tests/test_template_rendering.py:15:5:    email.send_email(
src/panel/billing/notifications.py:28:30:    to_address, cc_address = email.send_email(
[... 10 weitere Resultate]
</code></pre>
<blockquote>
<p><code>Attribute</code> sind die Knoten, bei denen mit dem <code>.</code>-Operator auf ein Attribut eines anderen Objektes zugegriffen wird, wie z.B. in <code>email.send_email</code>. <code>func[... or ...]</code> selektiert die Funktionsaufrufe beider Varianten (lokale/globale Variable und Attribut).</p>
</blockquote>
<p>Als Letztes schränke ich die Suche auf alle Stellen ein, an denen das Keyword-Only-Argument <code>reply_to_address</code> übergeben wird:</p>
<pre><code class="language-text">$ pyastgrep &#x27;.//Call[func[Name[@id=&quot;send_email&quot;] or Attribute[@attr=&quot;send_email&quot;]] and keywords/keyword[@arg=&quot;reply_to_address&quot;]]&#x27; src
src/panel/signals.py:18:13:            email.send_email(
src/panel/billing/notifications.py:51:5:    email.send_email(
src/project/tests/test_email_backend.py:30:5:    email.send_email(
src/db/access/member_helper.py:57:9:        send_email(
src/db/access/user/tickets.py:51:9:        email.send_email(
[... 5 weitere Resultate]
</code></pre>
<blockquote>
<p><code>Call[... and ...]</code> selektiert alle Funktionsaufrufe die beiden Bedingungen entsprechen (Funktionsname und Vorhandensein des Keyword-Arguments). <code>keywords/keyword[...]</code> iteriert über alle Keyword-Argumente des Funktionsaufrufs. <code>@arg=&quot;reply_to_address&quot;</code> selektiert die Keyword-Argumente, die das Keyword <code>reply_to_address</code> verwenden (<code>send_email(..., reply_to_address=...)</code>).</p>
</blockquote>
<p>Die Bedingung für das Keyword-Argument kann auch gut umgedreht werden. Als letztes Beispiel selektiere ich hier alle Aufrufe von <code>send_email()</code> bei denen das Argument <code>reply_to_address</code> <em>nicht</em> übergeben wird:</p>
<pre><code class="language-text">$ pyastgrep &#x27;.//Call[func[Name[@id=&quot;send_email&quot;] or Attribute[@attr=&quot;send_email&quot;]] and not(keywords/keyword[@arg=&quot;reply_to_address&quot;])]&#x27; src
src/panel/invoices/__init__.py:169:30:    to_address, cc_address = email.send_email(
src/panel/payment/__init__.py:46:9:        email.send_email(
src/panel/email/tests/test_template_rendering.py:15:5:    email.send_email(
src/panel/billing/notifications.py:28:30:    to_address, cc_address = email.send_email(
src/services/openstack/functions.py:103:5:    send_email(
</code></pre>
<h3>Genauigkeit ist immer eine Abwägung</h3>
<p>Als Abschluss möchte ich anmerken, dass das oben gezeigte Beispiel aus verschiedenen Gründen falsche Resultate liefern kann, also zu viele oder zu wenige. Jede dieser Abweichungen kann begegnet werden, mit jeweils unterschiedlichem Aufwand und nicht in jedem Fall perfekt. Dies sind ein paar Beispiele für Ungenauigkeiten, die ich im Beispiel oben zugelassen habe:</p>
<ul>
<li>Es könnte neben der gesuchten Funktion weitere Funktionen mit dem Namen <code>send_mail</code> geben. Um dem zu entgegnen, müssten die <code>import</code>-Anweisungen in jeder Source-Datei sowie das Vorhandensein von lokalen Variablen analysiert werden.</li>
<li>Die Funktion <code>send_mail</code> könnte in einer Datei unter einem anderen Namen importiert worden sein, z.B. mit <code>from panel.email import send_email as send_email_</code>, vielleicht um einem Namenskonflikt aus dem Weg zu gehen. Auch hierfür müssten die <code>import</code>-Anweisungen analysiert werden.</li>
<li>Das Argument <code>reply_to_address</code> könnte als Positional-Argument übergeben werden. In dem Fall müsste das Argument anhand der Position in der Argumentliste statt des Keywords <code>reply_to_address</code> selektiert werden.</li>
<li>Das Argument <code>reply_to_address</code> könnte dynamisch via <code>**kwargs</code> übergeben werden. Dieser Fall ist sehr schwierig automatisiert vollständig zu erkennen. In unserem Fall wäre es am effektivsten gewesen, die Stellen, an denen <code>**kwargs</code> verwendet wird, automatisiert zu finden und diese dann manuell zu prüfen.</li>
</ul>
<p>In den meisten Fällen gibt es keine perfekte Lösung, oder der Aufwand dafür ist grösser als der Nutzen. In diesen Fällen ist man gezwungen, eine Abwägung zwischen Genauigkeit, Flexibilität und Aufwand zu machen. Je grösser eine Applikation wird, je mehr gewinnen in meiner Erfahrung Genauigkeit und Flexibilität an Gewicht.</p>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Filling the Fridge - My onboarding @ cloudscale
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2024/11/28/den-kuehlschrank-fuellen-mein-onboarding-bei-cloudscale</link>
          <pubDate>Thu, 28 Nov 2024 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2024/11/28/den-kuehlschrank-fuellen-mein-onboarding-bei-cloudscale</guid>
          <description>
            <![CDATA[<p>If a new fridge arrives, the urge to just plug it, fill it with beverages and enjoy a cold one is big. But the setup actually requires quite some steps to ensure that the device runs reliably for a long time with low maintenance, so does a proper onboarding process. In this blog post I will use this odd analogy to describe my onboarding process as a Software Engineer in the Dev Team at cloudscale.</p>]]>
          </description>
          <content:encoded><![CDATA[<h3>Place and Position the Fridge</h3>
<p>My onboarding started with a one on one session with the Team Lead and included a mix of setup activities:</p>
<ul>
<li>Unboxing my personal hardware, basic MacBook and Backup setup</li>
<li>Create accounts for internal systems like LDAP, VPN, SSH, etc., according to the password policy</li>
<li>Reading and signing papers and more</li>
</ul>
<p>While commuting from Biel to Zürich and in the home office I could individualize my setup and familiarize myself with company tools and workflows.
I also got time to start working through the cloud exercises available on <a href="https://github.com/cloudscale-ch/cloud-exercises">GitHub</a>,
which gave me a better understanding how the API is working and how the customers are interacting with our system.</p>
<h3>Let the Fridge Settle</h3>
<p>Once I had everything set up, I was introduced to the software projects I would be working on at cloudscale.
In one of the next daily standups I was also assigned my first small task: Include Server Name in Extra Traffic Transaction Description.
Soon after, I was introduced into the code review / QA process, which enabled me to review and test the work of my teammates.</p>
<p>One by one I attended the companies different meeting formats:</p>
<ul>
<li>One-on-One: Weekly retrospective with my team lead</li>
<li>Sprint planing: Every two weeks the team discusses and decides the scope of the next sprint</li>
<li>Dev Team Insights: Each month our team presents an interesting insight, the last one was about how we use end-to-end testing in current projects</li>
<li>All-Hands Retrospective: A workshop where the whole company sits together to identify road blocks and proposes solutions for resolving them</li>
<li>Brownbag: A voluntary format in which an employee presents a topic (e.g.: Cloud Native Days), which usually takes place over lunchtime, hence the name</li>
</ul>
<h3>Set the Temperature</h3>
<p>A key experience for understanding the company was the introduction day with Mänu, the CEO of cloudscale.
We took a dive into the company’s history, structure, and where we are located in the market and the cloud pyramid,
covering everything from our server centers to the networks and topologies that support our infrastructure.
Alongside historical and technical insights, I learned about cloudscale&#x27;s core values:</p>
<ul>
<li>Quality - go the extra mile</li>
<li>360° Transparency - Internally and externally</li>
<li>Privacy / Security - From software to communication channels</li>
<li>Swissness - Location, reliability and Secrecy</li>
<li>Simplicity / Approachability - Being on eye level with customers, business partners and coworkers</li>
</ul>
<h3>Load Beverages Gradually</h3>
<p>To not overload a fridge, the beverages should not be filled all at once.
Analogous, I was gradually introduced in further important topics and given more responsibility/autonomy:</p>
<ul>
<li>Further sessions about implementation details in our projects</li>
<li>Working on bigger tasks</li>
<li>Writing this blog post</li>
<li>Information Security Management System (ISMS) and ISO/IEC 27002</li>
<li>Introduction to the system engineering team and which technologies they are using</li>
</ul>
<p>Once the above topics are cooled down, the fridge can be loaded further:</p>
<ul>
<li>Visiting the server centers</li>
<li>Introduction to Support</li>
<li>Holding the next Dev Team Insights Presentation</li>
<li>Organizing All-Hands Retrospective Meeting</li>
</ul>
<h3>Enjoy a cold one</h3>
<p>For me, switching Jobs triggered some insecurities, but thanks to a well-structured onboarding process and a very supportive team, the transitioning to cloudscale has been a smooth experience.
Cheers!</p>
<blockquote>
<p>Disclaimer: We just got a shiny new fridge with tasty beverages. I was not involved in the actual fridge project, and it was filled by the CEO himself.</p>
</blockquote>]]></content:encoded>
        </item>
        <item>
          <title><![CDATA[Staggering Restarts in Ceph
]]></title>
          <link>https://www.cloudscale.ch/en/engineering-blog/2024/11/27/gestaffelte-neustarts-in-ceph</link>
          <pubDate>Wed, 27 Nov 2024 00:00:00 GMT</pubDate>
          <guid isPermaLink="false">https://www.cloudscale.ch/en/engineering-blog/2024/11/27/gestaffelte-neustarts-in-ceph</guid>
          <description>
            <![CDATA[<p>How we restart OSDs in our Ceph clusters to avoid customer impact.</p>]]>
          </description>
          <content:encoded><![CDATA[<link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-concurrent-read.png"/><link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-concurrent-write.png"/><link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-5s-read.png"/><link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-5s-write.png"/><link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-concurrent-read-small.png"/><link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-concurrent-write-small.png"/><link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-5s-read-small.png"/><link rel="preload" as="image" href="https://www.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-5s-write-small.png"/><p>When customers use a disk in our cloud, they talk to one of our Ceph storage clusters. Every byte written is sent, every byte read received from one. The underlying physical hardware is abstracted away, shielding the VMs from unexpected disk failures and planned maintenance procedures.</p>
<p>To manage our clusters, we sometimes have to restart their services. Here&#x27;s how we do that while minimizing customer impact.</p>
<h3>Impact of Restarts</h3>
<p>When a single disk dies, it typically takes one or two OSDs with it. This can rarely be detected by our customers - after all, this is what Ceph is all about.</p>
<p>However, restarting a lot of OSDs concurrently causes throughput drops and latency spikes. A drop in throughput can be noticed because a PostgreSQL server might suddenly be much slower in scanning its tables, a drop in latency is especially noticeable in clusters like Etcd, where high latency might cause leader elections.</p>
<blockquote>
<p><strong>💡 Tip</strong></p>
<p>If you use Etcd in the cloud, tuning its time parameters may help avoid unnecessary leader elections:</p>
<ul>
<li><a href="https://etcd.io/docs/v3.4/tuning/">https://etcd.io/docs/v3.4/tuning/</a></li>
<li><a href="https://www.redhat.com/en/blog/introducing-selectable-profiles-for-etcd">https://www.redhat.com/en/blog/introducing-selectable-profiles-for-etcd</a></li>
</ul>
</blockquote>
<p>If we restart as many OSDs as possible simultaneously (a third of a cluster), the impact on customer VMs is quite visible in benchmarks:</p>
<table><tbody><tr><td><img src="https://static.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-concurrent-read-ae436b43172e.png" alt="Concurrent OSD restarts - reads." caption="Concurrent OSD restarts - reads."/></td><td><img src="https://static.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-concurrent-write-8a2a79c9c4a3.png" alt="Concurrent OSD restarts - writes." caption="Concurrent OSD restarts - writes."/></td></tr></tbody></table>
<p>If we manually stagger the OSD restarts five seconds apart, we get more spread out numbers, especially when it comes to latency:</p>
<table><tbody><tr><td><img src="https://static.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-5s-read-43db7da718e2.png" alt="5s staggered OSD restarts - reads." caption="5s staggered OSD restarts - reads."/></td><td><img src="https://static.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-5s-write-32a592dcd5b8.png" alt="5s staggered OSD restarts - writes." caption="5s staggered OSD restarts - writes."/></td></tr></tbody></table>
<p>The effect on throughput is not too dramatic, but that&#x27;s a function of the cluster size. On our smaller, internal clusters, the effect is more pronounced:</p>
<table><tbody><tr><td><img src="https://static.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-concurrent-read-small-c563aa30aeca.png" alt="Concurrent OSD restarts on small cluster - reads." caption="Concurrent OSD restarts on small cluster - reads."/></td><td><img src="https://static.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-concurrent-write-small-6d0f651c4584.png" alt="Concurrent OSD restarts on small cluster - writes." caption="Concurrent OSD restarts on small cluster - writes."/></td></tr></tbody></table>
<p>By staggering starts, we spread out the negative effects:</p>
<table><tbody><tr><td><img src="https://static.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-5s-read-small-52117353d921.png" alt="5s starggered OSD restarts on small cluster - reads." caption="5s starggered OSD restarts on small cluster - reads."/></td><td><img src="https://static.cloudscale.ch/img/engineering-blog-staggering-restarts-in-ceph-osd-5s-write-small-d298014bb294.png" alt="5s staggered OSD restarts on small cluster - writes." caption="5s staggered OSD restarts on small cluster - writes."/></td></tr></tbody></table>
<p>Note that staggering stops has no positive impact. Ceph uses <code>kill -9</code> on its OSDs when stopping them and it is designed to deal well with suddenly disappearing OSDs.</p>
<h3>Automating Staggered Restarts</h3>
<p>While we are able to manually start OSDs in a staggered fashion, we have situations where we cannot do that. We want staggered OSD starts when a host unexpectedly reboots, when it gets reinstalled, when we do a package upgrade and so on. Ideally, we don&#x27;t want to have to think about it.</p>
<p>To achieve this, we wrote a Python script that uses cluster-wide locking to ensure that OSDs are started slightly apart:</p>
<details><summary><code>stagger-osds.py</code></summary><pre><code class="language-python">#!/usr/bin/env python3
&quot;&quot;&quot;

About
-----

A set of commands to force OSDs to start in a staggered fashion, avoiding
excessive peering load on the cluster.

Implementation
--------------

Uses an exclusive lock on an image in RADOS:

1. The script starts two processes: a main and a detached lock process.
2. The main process waits for the lock process to acquire a lock.
3. The main process calls `exec` on the additional commandline arguments.
4. The lock process runs until the lock expires.

Signals
-------

Signals are handled separately by the lock and the exec process.

It is possible to use a signal to kill the exec process before it runs the
command. In this case the command will not be executed. Once the command is
executing, signal handling is up to that command.

The lock process can also be killed, which causes it to release the lock
early.

Short-Lived Commands
--------------------

This command is meant for OSDs that may be up for weeks or months. If used
on a short-lived command (or an OSD that exits immediately), the lock is
not held for the whole duration.

The lock process detects if the parent pid has exited. In this case, the
lock is released early and the lock process exits.

This behavior relies on the fact that the exec process&#x27;s pid is not used by
another process. This is generally no problem with short lock durations,
and a normal amount of processes. Should the pid be reused for some reason,
the effect will be a lock that is held for too long.

To avoid that race condition, we could use pidfd_open, but that is only
available in Python 3.9.

Requirements
------------

To run, this command expects the following packages to be present:

- librados for python: https://docs.ceph.com/en/reef/rados/api/librados-intro
- click
- pydantic

The current Python target is Python 3.8.

&quot;&quot;&quot;
from __future__ import annotations

import click
import code
import json
import logging
import mmap
import os
import random
import secrets
import shlex
import signal
import socket
import sys
import time
import yaml

from configparser import ConfigParser
from contextlib import contextmanager
from datetime import datetime
from datetime import timedelta
from pydantic import BaseModel
from pydantic import Field
from pydantic.types import FilePath
from rados import Ioctx
from rados import ObjectBusy
from rados import ObjectNotFound
from rados import Rados
from threading import Event
from types import FrameType
from typing import cast
from typing import Iterator
from typing import List
from typing import Optional

# The default config path
DEFAULT_CONFIG = &#x27;/etc/stagger-osds.yml&#x27;

# The ID of the client
CLIENT = &#x27;stagger-osds&#x27;

config_option = click.option(
    &#x27;--config&#x27;,
    default=DEFAULT_CONFIG,
    type=click.Path(exists=True, dir_okay=False),
)

log = logging.getLogger(CLIENT)


@click.group()
def cli() -&gt; None:
    pass


class Config(BaseModel):

    # The path to the global ceph config
    ceph_config: FilePath

    # Keyring to authenticate with
    keyring: FilePath

    # Name of the pool to use
    pool: str = Field(
        ...,
        min_length=3,

        # Mypy 1.7+ fails here, the correct solution would be to use
        # Annotated[str, Field(min_length=3, strip_whitespace=True], but that
        # is not supported on the proxmox host.
        # type: ignore[call-arg]
        strip_whitespace=True,
    )

    # Key of the object in the selected pool
    key: str = Field(
        ...,
        min_length=3,

        # Mypy 1.7+ fails here, the correct solution would be to use
        # Annotated[str, Field(min_length=3, strip_whitespace=True], but that
        # is not supported on the proxmox host.
        # type: ignore[call-arg]
        strip_whitespace=True,
    )

    # How long to maximally keep the lock active (in seconds)
    duration: int = Field(..., gt=0, lt=61)

    # How long to try and acquire a lock (in seconds)
    timeout: int = Field(..., gt=0, lt=3601)

    @classmethod
    def from_yaml(cls, path: str) -&gt; Config:
        with open(path, &#x27;r&#x27;) as f:
            return cls.parse_obj(yaml.safe_load(f))

    def configure_logging(self, process: str, verbose: bool) -&gt; None:
        level = verbose and logging.DEBUG or logging.WARNING
        fmt = f&quot;{process} [%(levelname)s] %(message)s&quot;

        logging.basicConfig(format=fmt, level=level)

    @contextmanager
    def cluster_connection(self) -&gt; Iterator[Rados]:
        &quot;&quot;&quot; Yields a connection to the cluster, by using the global ceph
        config for mon discovery, the keyring for client authentication.

        &quot;&quot;&quot;
        conf = ConfigParser()
        conf.read(self.ceph_config)
        conf = dict(conf[&quot;global&quot;].items())

        client = f&quot;client.{CLIENT}&quot;
        conffile = str(self.keyring)

        with Rados(name=client, conf=conf, conffile=conffile) as cluster:
            yield cluster

    @contextmanager
    def ioctx(self) -&gt; Iterator[Ioctx]:
        &quot;&quot;&quot; Yields an ioctx. It is imperative that this is done on a new
        cluster connection each time, as there&#x27;s some internal state in regards
        to application metadata, causing values to be cached agressively.

        &quot;&quot;&quot;
        with self.cluster_connection() as cluster:
            with cluster.open_ioctx(self.pool) as ioctx:
                yield ioctx

    def metadata_get(self, ioctx: Ioctx, key: str, default: str) -&gt; str:
        &quot;&quot;&quot; Gets the given metadata key, or a default value. &quot;&quot;&quot;

        try:
            return cast(str, ioctx.application_metadata_get(CLIENT, key))
        except KeyError:
            return default

    def metadata_set(self, ioctx: Ioctx, key: str, val: str) -&gt; None:
        &quot;&quot;&quot; Sets the given metadata key. Strings longer than 100 characters
        are truncated, as they otherwise cause errors.

        &quot;&quot;&quot;
        ioctx.application_metadata_set(CLIENT, key, val[:100])

    def metadata_remove(self, ioctx: Ioctx, key: str) -&gt; None:
        &quot;&quot;&quot; Deletes the given metadata key (idempotent). &quot;&quot;&quot;

        ioctx.application_metadata_remove(CLIENT, key)

    def disable(self, ioctx: Ioctx) -&gt; None:
        &quot;&quot;&quot; Disables lock acquisition with immediate effect. The current lock
        is released, and all OSDs waiting for a lock are started. As long
        as osd-staggering is disabled, locking is skipped completely.

        &quot;&quot;&quot;
        self.metadata_set(ioctx, &#x27;disabled&#x27;, &#x27;1&#x27;)

    def enable(self, ioctx: Ioctx) -&gt; None:
        &quot;&quot;&quot; Removes the &#x27;disabled&#x27; state. &quot;&quot;&quot;

        self.metadata_remove(ioctx, &#x27;disabled&#x27;)

    @property
    def is_disabled(self) -&gt; bool:
        &quot;&quot;&quot; True if locking is disabled. Uses its own connection to the cluster
        as the cache might otherwise return outdated values.

        &quot;&quot;&quot;

        with self.ioctx() as ioctx:
            return self.metadata_get(ioctx, &#x27;disabled&#x27;, &#x27;0&#x27;) == &#x27;1&#x27;


class LockHolder(BaseModel):
    &quot;&quot;&quot; Metadata about the current lock holder. &quot;&quot;&quot;

    # The hostname of the lock holder
    host: str

    # The command being executed as an argument list as expected by exec*
    command: List[str]

    # The PID of the exec process
    ppid: int

    # The time in UTC when this lock expires
    expires: datetime

    @property
    def is_expired(self) -&gt; bool:
        return datetime.utcnow() &gt; self.expires


class SharedBool:
    &quot;&quot;&quot; Anonymous memory mapped bool, that can be used by multiple processes
    at the same time. Currently used for a single-writer model - not sure if
    there are race-conditions in a multi-writer scenario.

    &quot;&quot;&quot;

    def __init__(self, default: bool = False):
        self.mmap = mmap.mmap(-1, 1)
        self.store(default)

    def __bool__(self) -&gt; bool:
        self.mmap.seek(0)
        return self.mmap.read(1) == b&#x27;1&#x27;

    def __repr__(self) -&gt; str:
        return f&quot;SharedBool({self and &#x27;True&#x27; or &#x27;False&#x27;})&quot;

    def store(self, value: bool) -&gt; None:
        self.mmap.seek(0)
        self.mmap.write(value and b&#x27;1&#x27; or b&#x27;0&#x27;)


def signal_handler() -&gt; Event:
    &quot;&quot;&quot; Provides signal handling for each process. Should be called after
    forking, as each process must have its own signal handling.

    &quot;&quot;&quot;
    interrupt = Event()

    def quit(signum: int, frame: Optional[FrameType]) -&gt; None:
        interrupt.set()

    for sig in (signal.SIGTERM, signal.SIGHUP, signal.SIGINT):
        signal.signal(sig, quit)

    return interrupt


def ensure_session_leader() -&gt; None:
    &quot;&quot;&quot; Ensures that the current process is promoted session leader, if it
    is not one already.

    &quot;&quot;&quot;

    if os.getpid() != os.getpgid(0):
        os.setsid()


def is_pid_running(pid: int) -&gt; bool:
    &quot;&quot;&quot; Returns True if the given PID is of a running process. &quot;&quot;&quot;

    try:
        os.kill(pid, 0)
    except OSError:
        return False
    else:
        return True


def wait_for_lock_then_hold(start_osd: SharedBool, ppid: int, config: Config,
                            command: List[str]) -&gt; None:
    &quot;&quot;&quot; Tries to acquire a lock on the object with the given name, then holds
    it for the given duration.

    The lock is released early under the following circumstances:
        - The process with pid `ppid` no longer exists.
        - An interrupt is received.

    Meant to be run in a separate process.

    &quot;&quot;&quot;

    # Ensure we get separate signal handling
    ensure_session_leader()
    interrupt = signal_handler()

    # Lock configuration
    lock_cfg = {
        # The key of the object that is locked
        &#x27;key&#x27;: config.key,

        # The name of the lock on the object
        &#x27;name&#x27;: CLIENT,

        # Unique string belonging to the owner of the lock. This string is
        # required to unlock the lock pre-maturely.
        &#x27;cookie&#x27;: secrets.token_hex(8),
    }

    try:
        with config.ioctx() as ioctx:

            # Exit condition for the acquisition and hold loops
            def keep_running(until: float, warn_on_timeout: bool) -&gt; bool:
                if time.monotonic() &gt; until:
                    if warn_on_timeout:
                        log.warning(&quot;Timeout expired&quot;)
                    return False

                if interrupt.is_set():
                    log.info(&quot;Interrupt received&quot;)
                    return False

                if not is_pid_running(ppid):
                    log.info(&quot;Parent process exited&quot;)
                    return False

                if config.is_disabled:
                    log.info(&quot;Stagger disabled globally&quot;)
                    return False

                # Randomize the wait time slightly, to be extra sure that we
                # are not creating a thundering herd (not that that is
                # likely).
                return not interrupt.wait(random.uniform(0.9, 1.1))

            # Try to acquire the lock
            log.info(f&quot;Acquiring lock (timeout {config.timeout}s)&quot;)
            until = time.monotonic() + config.timeout

            while keep_running(until, warn_on_timeout=True):
                try:
                    ioctx.lock_exclusive(**lock_cfg, duration=config.duration)
                    log.info(&quot;Acquired lock, cleared for launch&quot;)
                    start_osd.store(True)
                    break
                except ObjectBusy:
                    continue
                except Exception:
                    log.exception(&quot;Unexpected error while trying to get lock&quot;)
                    break

            if not start_osd:

                # If we cannot get a lock due to a timeout or an error, we
                # signal to the OSD to start anyway.
                log.warning(&quot;Could not acquire lock, cleared for launch&quot;)
                start_osd.store(True)
                return

            # Write some information about the current lock holder
            holder = LockHolder(
                host=socket.gethostname(),
                command=command,
                ppid=ppid,
                expires=datetime.utcnow() + timedelta(seconds=config.duration)
            )

            ioctx.write_full(config.key, holder.json().encode(&#x27;utf-8&#x27;))

            # Keep the lock for the given duration
            try:
                log.info(f&quot;Holding lock for up to {config.duration}s&quot;)

                until = time.monotonic() + config.duration
                while keep_running(until, warn_on_timeout=False):
                    pass

            finally:

                # The lock vanishes on its own, if the duration is up. Since
                # we might be too late, we can only try to unlock, but may
                # be too late.
                try:
                    ioctx.unlock(**lock_cfg)
                except ObjectNotFound:
                    pass

            # Release the lock
            log.info(&quot;Released lock&quot;)

    except Exception:

        # If there are *any* issues, we let the OSDs start.
        start_osd.store(True)
        log.exception(f&quot;Could not connect to cluster using {config.keyring}&quot;)
        sys.exit(1)


def wait_for_lock(start_osd: SharedBool, timeout: int) -&gt; bool:
    &quot;&quot;&quot; Waits for the `start_osd` variable to change to `True`, returning
    `True` if the OSD should be started, `False` otherwise.
    &quot;&quot;&quot;

    # Ensure we get separate signal handling
    ensure_session_leader()
    interrupt = signal_handler()

    # Wait to acquire the lock
    log.info(&quot;Awaiting lock&quot;)

    # The same timeout that eventually abandons the lock, can be used here,
    # so we don&#x27;t get stuck if the lock process dies.
    until = time.monotonic() + timeout
    while not (start_osd or interrupt.is_set()) and time.monotonic() &lt; until:
        interrupt.wait(0.5)

    # If the interrupt has been received, stop:
    if interrupt.is_set():
        log.info(&quot;Interrupt received: Aborting&quot;)
        return False

    return True


@cli.command(context_settings={
    &quot;allow_extra_args&quot;: True,
    &quot;ignore_unknown_options&quot;: True
})
@click.pass_context
@config_option
@click.option(&#x27;--verbose&#x27;, is_flag=True, default=False)
def start(ctx: click.Context, config: str, verbose: bool) -&gt; None:
    &quot;&quot;&quot; Wraps the given command with a cluster-wide lock, forcing the command
    to start at a staggered interval (spread out), even if run on multiple
    machines.

    &quot;&quot;&quot;

    # Load config
    config = Config.from_yaml(config)

    # Load command
    command = ctx.args

    if not command:
        print(&quot;You must specify a command to run, after &#x27;--&#x27;&quot;, file=sys.stderr)
        sys.exit(1)

    # Shared lock
    start_osd = SharedBool(False)

    # The PID of the process destined to be the command
    ppid = os.getpid()

    # Fork a process that launches the lock
    if os.fork() == 0:

        # Fork the lock process
        if os.fork() == 0:
            config.configure_logging(process=&quot;lock-process&quot;, verbose=verbose)
            wait_for_lock_then_hold(
                start_osd=start_osd,
                ppid=ppid,
                command=command,
                config=config)

        # Launcher and lock process are both awaited by the init process
        sys.exit(0)

    # Wait for the process that launches the lock to exit (this is quick)
    os.wait()

    config.configure_logging(process=&quot;exec-process&quot;, verbose=verbose)
    if wait_for_lock(start_osd=start_osd, timeout=config.timeout):
        log.info(shlex.join(command))
        os.execvp(command[0], command)


@cli.command()
@config_option
def status(config: str) -&gt; None:
    &quot;&quot;&quot; Show the current status. &quot;&quot;&quot;

    config = Config.from_yaml(config)
    print(&quot;Status:&quot;, config.is_disabled and &quot;Disabled&quot; or &quot;Enabled&quot;)

    with config.ioctx() as ioctx:
        try:
            holder = ioctx.read(config.key)
        except KeyError:
            holder = None
        else:
            holder = holder and LockHolder.parse_obj(json.loads(holder))

    if not holder or holder.is_expired:
        print(&quot;Last Lock: Expired&quot;)
    else:
        print(f&quot;Last Lock: Acquired on {holder.host}&quot;)
        print(f&quot;PPID: {holder.ppid}&quot;)
        print(f&quot;Command: {shlex.join(holder.command)}&quot;)


@cli.command()
@config_option
def enable(config: str) -&gt; None:
    &quot;&quot;&quot; Enable OSD staggering. &quot;&quot;&quot;

    config = Config.from_yaml(config)

    with config.ioctx() as ioctx:
        config.enable(ioctx)


@cli.command()
@config_option
def disable(config: str) -&gt; None:
    &quot;&quot;&quot; Disable OSD staggering, immediately seizing all locking, even
    for locks that are currently in progress.

    &quot;&quot;&quot;

    config = Config.from_yaml(config)

    with config.ioctx() as ioctx:
        config.disable(ioctx)


@cli.command(hidden=True)
@config_option
def shell(config: str) -&gt; None:
    &quot;&quot;&quot; Drop into a Python shell, with an active ioctx. Only use if you know
    what you are doing!

    &quot;&quot;&quot;

    config = Config.from_yaml(config)

    with config.ioctx() as ioctx:
        code.interact(local={**globals(), **locals()})


if __name__ == &#x27;__main__&#x27;:
    cli()
</code></pre></details>
<p>This gives us the <code>stagger-osds</code> CLI tool, that we can use to wrap commands that are then run with a cluster-wide lock on a chosen object in a designated pool.</p>
<p>In a sense, it works a lot like <code>flock</code>. You can give it a command, that is then run with an exclusive lock on the cluster:</p>
<pre><code class="language-bash">$ stagger-osds start --verbose -- sleep 5
exec-process [INFO] Awaiting lock
lock-process [INFO] Acquiring lock (timeout 50s)
lock-process [INFO] Acquired lock, cleared for launch
lock-process [INFO] Holding lock for up to 4s
exec-process [INFO] sleep 5
lock-process [INFO] Released lock
</code></pre>
<p>While the lock is used, it shows up in the status command:</p>
<pre><code class="language-bash">$ stagger-osds status
Status: Enabled
Last Lock: Acquired on lab-nvme1-a-lpg1
PPID: 3927556
Command: sleep 5
</code></pre>
<p>With <code>stagger-osds disable</code> we can disable the lock, causing all processes currently waiting for the lock to be started instantly.</p>
<h3>Systemd Integration</h3>
<p>To start all <code>ceph-osd@.service</code> units wrapped in the <code>stagger-osds</code> wrapper, we use the following drop-in:</p>
<pre><code class="language-ini">[Service]
ExecStart=
ExecStart=/usr/local/bin/stagger-osds start --verbose -- /usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
SyslogIdentifier=ceph-osd
TimeoutSec=148
</code></pre>
<p>The timeout needs to be configured to accommodate the start of 1/3 of all OSDs on the cluster (after the timeout is over, all OSDs are started right away). Here is the kind of config one might use on a cluster with 600 OSDs:</p>
<ul>
<li>1/3 of the Cluster: 600 OSDs / 3 = 200 OSDs</li>
<li>Timeout: 200 OSDs x 4s = 800s</li>
</ul>
<pre><code class="language-yaml">ceph_config: /etc/ceph/ceph.conf
keyring: /etc/ceph/ceph.client.stagger-osds.keyring
pool: sys
key: stagger-osds
duration: 4
timeout: 800
</code></pre>
<p>The keyring was created as follows:</p>
<pre><code class="language-bash">ceph auth get-or-create \
    client.stagger-osds \
    mon &quot;allow rw&quot; \
    osd &quot;allow rwx pool stagger-osds&quot; \
    -o /etc/ceph/ceph.client.stagger-osds.keyring
</code></pre>
<h3>Conclusion</h3>
<p>Stagger OSDs is active on all our Ceph clusters, and since we have started using it, clusters run with more predictable performance, even during major upgrades.</p>
<p>Using builtin Ceph tools, we are also able to use global locks without the likes of Zookeeper, Etcd or Redis.</p>]]></content:encoded>
        </item>
      </channel>
    </rss>