Engineering Blog

Zurück
Publiziert am 31. März 2026 von

Snapshots sind keine Backups: Disaster Recovery für Kubernetes-Workloads

Dieser Inhalt ist nur auf Englisch verfügbar:

Snapshots make Kubernetes operations safer, but they are only one part of a Disaster Recovery strategy. In this Engineering Blog post, we show how Kubernetes workloads on cloudscale infrastructure can be restored even after losing an entire cluster, and what is required to turn snapshots into real recovery capabilities.

Introduction

When we introduced CSI snapshot support in our Kubernetes CSI driver, we stressed an important distinction: snapshots help with operational recovery, but they do not replace backups.

Snapshots are ideal for rolling back changes or recovering from failed deployments. However, because they remain tied to the original storage environment, they do not protect against the loss of a full Kubernetes setup.

In this post, we outline how to build a Disaster Recovery workflow for Kubernetes workloads on cloudscale infrastructure, combining CSI snapshots, Velero orchestration, and cross-zone Object Storage to restore applications and their data after a complete environment loss.

Snapshots vs Backups

Before touching YAML or CLI commands, we need a shared terminology.

| Term              | Meaning                                                |
|-------------------|--------------------------------------------------------|
| Snapshot          | Point-in-time copy stored on the same storage cluster  |
| Backup            | Recoverable copy independent of original infrastructure|
| Disaster Recovery | Ability to rebuild workloads after infrastructure loss |

Snapshots are fast because they live close to the source volume. On cloudscale, they are implemented using copy-on-write technology inside the storage cluster.

That makes them ideal for:

  • Upgrade safety nets
  • Migrations
  • Quick rollback scenarios
  • Cloning production data into test environments

But if the storage cluster itself disappears, snapshots disappear with it. This is why we explicitly recommend keeping a copy of your data at another geographic location. The remainder of this article shows how to implement exactly that, using standard Kubernetes tooling.

What CSI Snapshots Change

With the release of CSI snapshot support, Kubernetes gains native awareness of storage recovery points. The driver exposes the standard Kubernetes VolumeSnapshot API, which means snapshots are no longer something managed exclusively through a provider interface or our Control Panel. Instead, they become first-class Kubernetes resources.

At first glance this may look like a small technical addition. Operationally, however, it changes how backup and recovery workflows can be designed. Once snapshots exist as Kubernetes objects, ecosystem tools can interact with them directly. Backup software such as Velero can request snapshots, track them as part of a backup operation, and later use them during restores, all through standard Kubernetes APIs. The result is a workflow that remains portable, automation-friendly and aligned with upstream Kubernetes concepts.

This capability also highlights a limitation that often goes unnoticed. Many Kubernetes backup strategies rely on exported manifests combined with storage snapshots. As long as the cluster and its storage remain available, recovery appears straightforward, deleted namespaces or failed deployments can usually be restored without difficulty.

The situation changes once the underlying storage is no longer accessible. Recreating Kubernetes objects is rarely the challenge, but recovering the data they depend on, is.

Disaster Recovery therefore requires separating three independent concerns: Kubernetes resource state, a consistent recovery source for volume data, and a durable copy stored outside the original infrastructure. CSI snapshots address only one of these aspects. They provide fast recovery points, but they remain bound to the same storage environment.

This distinction leads directly to the hybrid approach described next.

Demo

Before starting, you need:

  • A Kubernetes cluster (version 1.28 or newer)
  • cloudscale CSI driver installed (at least v4.0.0)
  • Velero CLI installed locally (tested with v1.18.0)
  • S3-compatible Object Storage.

With that, we build a hybrid approach:

  • CSI snapshots provide consistent recovery sources.
  • Velero orchestrates backups and restores.
  • Object Storage stores durable copies in another region.

In this example, the Kubernetes cluster runs in LPG, while backup data is written to Object Storage in RMA, separating recovery data from the original infrastructure. The same approach also works with other providers.

The core idea is simple: snapshots provide consistency, while Object Storage provides survivability.

1. Create Object Storage Backup Location

First, create an Object User and save its credentials to a file:

cat <<EOF > credentials-velero
[default]
aws_access_key_id=<ACCESS_KEY>
aws_secret_access_key=<SECRET_KEY>
EOF

Then create a bucket in a different region than the cluster. In this example we name it velero-backups. As our Object Storage exposes an S3-compatible API, Velero's AWS plugin works without modifications.

2. Install Velero with CSI Support

The important part is enabling both CSI snapshots and the Data Mover.

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.10.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --use-node-agent \
  --backup-location-config \
  region=RMA,s3ForcePathStyle=true,s3Url=https://objects.rma.cloudscale.ch \
  --snapshot-location-config region=LPG \
  --features=EnableCSI,EnableCSIDataMover

What this configuration establishes:

  • Snapshots are created in LPG (the region where the cluster is running)
  • Backup data is stored in RMA (needs to be the site the bucket has been created)

This way, restores do not depend on the original storage cluster. This separation is the foundation of Disaster Recovery.

3. Create a Demo Workload

We deploy a minimal namespace containing a PVC and a pod writing data to a file.

kubectl create ns backup-demo

Create a PersistentVolumeClaim:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: demo-pvc
  namespace: backup-demo
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: cloudscale-volume-ssd
EOF

Writer pod:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: writer
  namespace: backup-demo
spec:
  restartPolicy: Never
  containers:
    - name: writer
      image: busybox
      command:
        - sh
        - -c
        - |
          while true; do
            echo "cloudscale recovery demo \$(date '+%Y-%m-%d %H:%M:%S')" >> /data/value.txt
            sleep 5
          done
      volumeMounts:
        - mountPath: /data
          name: vol
  volumes:
    - name: vol
      persistentVolumeClaim:
        claimName: demo-pvc
EOF

Verify data exists:

kubectl exec -n backup-demo writer -- cat /data/value.txt

4. Create the Backup

Create the Disaster Recovery backup:

velero backup create dr-demo \
  --include-namespaces backup-demo \
  --snapshot-move-data \
  --wait

Velero now performs a coordinated backup workflow.

First, Velero stores Kubernetes resource metadata in Object Storage.

Next, Velero requests CSI snapshots for all PersistentVolumeClaims in the namespace. The cloudscale CSI driver creates the snapshots without interrupting running workloads.

Velero then prepares the data transfer by creating a temporary volume from each snapshot. The Velero Node Agent mounts this volume and reads its contents using the CSI Data Mover.

The snapshot data is copied into Object Storage in RMA, while the production PVC remains untouched throughout the process. The temporary volumes are removed again.

5. Simulate Catastrophic Failure

A backup that is never restored is only a theory, so let's simulate loss of the environment:

kubectl delete ns backup-demo

Everything disappears: Pods, PVCs, and Kubernetes objects are removed, leaving only the off-site backup. This can be verified in the Control Panel, where the PVCs are no longer present.

6. Restore the Namespace

We can now load the backup to recreate the workload.

velero restore create dr-demo-restore \
--from-backup dr-demo \
--wait

Velero recreates the namespace, creates a PVC, restores the volume content and starts a Pod. This can take a moment. After the process is done, you can verify data:

kubectl exec -n backup-demo writer -- tail /data/value.txt

If the value matches the original one, while new values are written, the Disaster Recovery test succeeded.

Conclusion

CSI snapshot support brings cloudscale storage into Kubernetes-native recovery workflows. Using standard APIs and tools like Velero, snapshots can evolve from simple rollback mechanisms into real recovery strategies.

The example shown here is intentionally minimal: back up a namespace, remove it, and restore it from scratch. The goal is not complexity, but proof that recovery works independently of the original environment.

The key takeaway is straightforward: Disaster Recovery begins where recovery no longer depends on the infrastructure that failed.


Wenn du uns Kommentare oder Korrekturen mitteilen möchtest, kannst du unsere Engineers unter engineering-blog@cloudscale.ch erreichen.

Zurück zur Übersicht