Engineering Blog
BackSnapshots Are Not Backups: Disaster Recovery for Kubernetes Workloads
Snapshots make Kubernetes operations safer, but they are only one part of a Disaster Recovery strategy. In this Engineering Blog post, we show how Kubernetes workloads on cloudscale infrastructure can be restored even after losing an entire cluster, and what is required to turn snapshots into real recovery capabilities.
Introduction
When we introduced CSI snapshot support in our Kubernetes CSI driver, we stressed an important distinction: snapshots help with operational recovery, but they do not replace backups.
Snapshots are ideal for rolling back changes or recovering from failed deployments. However, because they remain tied to the original storage environment, they do not protect against the loss of a full Kubernetes setup.
In this post, we outline how to build a Disaster Recovery workflow for Kubernetes workloads on cloudscale infrastructure, combining CSI snapshots, Velero orchestration, and cross-zone Object Storage to restore applications and their data after a complete environment loss.
Snapshots vs Backups
Before touching YAML or CLI commands, we need a shared terminology.
| Term | Meaning |
|-------------------|--------------------------------------------------------|
| Snapshot | Point-in-time copy stored on the same storage cluster |
| Backup | Recoverable copy independent of original infrastructure|
| Disaster Recovery | Ability to rebuild workloads after infrastructure loss |
Snapshots are fast because they live close to the source volume. On cloudscale, they are implemented using copy-on-write technology inside the storage cluster.
That makes them ideal for:
- Upgrade safety nets
- Migrations
- Quick rollback scenarios
- Cloning production data into test environments
But if the storage cluster itself disappears, snapshots disappear with it. This is why we explicitly recommend keeping a copy of your data at another geographic location. The remainder of this article shows how to implement exactly that, using standard Kubernetes tooling.
What CSI Snapshots Change
With the release of CSI snapshot support, Kubernetes gains native awareness of storage recovery points. The driver exposes the standard Kubernetes VolumeSnapshot API, which means snapshots are no longer something managed exclusively through a provider interface or our Control Panel. Instead, they become first-class Kubernetes resources.
At first glance this may look like a small technical addition. Operationally, however, it changes how backup and recovery workflows can be designed. Once snapshots exist as Kubernetes objects, ecosystem tools can interact with them directly. Backup software such as Velero can request snapshots, track them as part of a backup operation, and later use them during restores, all through standard Kubernetes APIs. The result is a workflow that remains portable, automation-friendly and aligned with upstream Kubernetes concepts.
This capability also highlights a limitation that often goes unnoticed. Many Kubernetes backup strategies rely on exported manifests combined with storage snapshots. As long as the cluster and its storage remain available, recovery appears straightforward, deleted namespaces or failed deployments can usually be restored without difficulty.
The situation changes once the underlying storage is no longer accessible. Recreating Kubernetes objects is rarely the challenge, but recovering the data they depend on, is.
Disaster Recovery therefore requires separating three independent concerns: Kubernetes resource state, a consistent recovery source for volume data, and a durable copy stored outside the original infrastructure. CSI snapshots address only one of these aspects. They provide fast recovery points, but they remain bound to the same storage environment.
This distinction leads directly to the hybrid approach described next.
Demo
Before starting, you need:
- A Kubernetes cluster (version 1.28 or newer)
- cloudscale CSI driver installed (at least v4.0.0)
- Velero CLI installed locally (tested with v1.18.0)
- S3-compatible Object Storage.
With that, we build a hybrid approach:
- CSI snapshots provide consistent recovery sources.
- Velero orchestrates backups and restores.
- Object Storage stores durable copies in another region.
In this example, the Kubernetes cluster runs in LPG, while backup data is written to Object Storage in RMA, separating recovery data from the original infrastructure. The same approach also works with other providers.
The core idea is simple: snapshots provide consistency, while Object Storage provides survivability.
1. Create Object Storage Backup Location
First, create an Object User and save its credentials to a file:
cat <<EOF > credentials-velero
[default]
aws_access_key_id=<ACCESS_KEY>
aws_secret_access_key=<SECRET_KEY>
EOF
Then create a bucket in a different region than the cluster. In this example we name it velero-backups.
As our Object Storage exposes an S3-compatible API, Velero's AWS plugin works without modifications.
2. Install Velero with CSI Support
The important part is enabling both CSI snapshots and the Data Mover.
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.10.0 \
--bucket velero-backups \
--secret-file ./credentials-velero \
--use-node-agent \
--backup-location-config \
region=RMA,s3ForcePathStyle=true,s3Url=https://objects.rma.cloudscale.ch \
--snapshot-location-config region=LPG \
--features=EnableCSI,EnableCSIDataMover
What this configuration establishes:
- Snapshots are created in LPG (the region where the cluster is running)
- Backup data is stored in RMA (needs to be the site the bucket has been created)
This way, restores do not depend on the original storage cluster. This separation is the foundation of Disaster Recovery.
3. Create a Demo Workload
We deploy a minimal namespace containing a PVC and a pod writing data to a file.
kubectl create ns backup-demo
Create a PersistentVolumeClaim:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: demo-pvc
namespace: backup-demo
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: cloudscale-volume-ssd
EOF
Writer pod:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: writer
namespace: backup-demo
spec:
restartPolicy: Never
containers:
- name: writer
image: busybox
command:
- sh
- -c
- |
while true; do
echo "cloudscale recovery demo \$(date '+%Y-%m-%d %H:%M:%S')" >> /data/value.txt
sleep 5
done
volumeMounts:
- mountPath: /data
name: vol
volumes:
- name: vol
persistentVolumeClaim:
claimName: demo-pvc
EOF
Verify data exists:
kubectl exec -n backup-demo writer -- cat /data/value.txt
4. Create the Backup
Create the Disaster Recovery backup:
velero backup create dr-demo \
--include-namespaces backup-demo \
--snapshot-move-data \
--wait
Velero now performs a coordinated backup workflow.
First, Velero stores Kubernetes resource metadata in Object Storage.
Next, Velero requests CSI snapshots for all PersistentVolumeClaims in the namespace. The cloudscale CSI driver creates the snapshots without interrupting running workloads.
Velero then prepares the data transfer by creating a temporary volume from each snapshot. The Velero Node Agent mounts this volume and reads its contents using the CSI Data Mover.
The snapshot data is copied into Object Storage in RMA, while the production PVC remains untouched throughout the process. The temporary volumes are removed again.
5. Simulate Catastrophic Failure
A backup that is never restored is only a theory, so let's simulate loss of the environment:
kubectl delete ns backup-demo
Everything disappears: Pods, PVCs, and Kubernetes objects are removed, leaving only the off-site backup. This can be verified in the Control Panel, where the PVCs are no longer present.
6. Restore the Namespace
We can now load the backup to recreate the workload.
velero restore create dr-demo-restore \
--from-backup dr-demo \
--wait
Velero recreates the namespace, creates a PVC, restores the volume content and starts a Pod. This can take a moment. After the process is done, you can verify data:
kubectl exec -n backup-demo writer -- tail /data/value.txt
If the value matches the original one, while new values are written, the Disaster Recovery test succeeded.
Conclusion
CSI snapshot support brings cloudscale storage into Kubernetes-native recovery workflows. Using standard APIs and tools like Velero, snapshots can evolve from simple rollback mechanisms into real recovery strategies.
The example shown here is intentionally minimal: back up a namespace, remove it, and restore it from scratch. The goal is not complexity, but proof that recovery works independently of the original environment.
The key takeaway is straightforward: Disaster Recovery begins where recovery no longer depends on the infrastructure that failed.
If you have comments or corrections to share, you can reach our engineers at engineering-blog@cloudscale.ch.