Incident Report: Storage Cluster
During the last three days we were facing problems with our storage cluster. Because the underlying file system reported inaccurate usage data, we ran out of storage capacity without noticing. We replaced Btrfs with XFS and added more SSDs to restore normal operations. The cluster is up and running again and I/O performance is back at high speeds. We are very sorry for the troubles this may have caused.
About our storage setup
At cloudscale.ch, we use a distributed storage cluster which is powered by Ceph. On top of our SSD-only setup Ceph further increases I/O performance by accessing multiple storage nodes in parallel along with providing virtually infinite scalability and data replication for fault tolerance.
For the actual SSD drives, Object storage devices or OSDs in Ceph speak, we used Btrfs as a file system. While Btrfs is relatively new, it is widely regarded as the future standard and supported by all major Linux distributions. SUSE Linux Enterprise Server for example uses Btrfs as its default file system.
How it all started
In the evening of last Sunday, Aug 2 2015, our server monitoring stated all systems were operating as usual. However users started reporting problems with I/O performance which we were able to reproduce quickly.
Our first investigation revealed increased CPU load on one of the storage nodes. Because the node was quite unresponsive at that time we had to hard reset it. After that, the Ceph cluster was supposed to recover automatically implying a small performance decrease on your virtual servers.
Narrowing the problem down
On Monday morning, Ceph recovery still had not completed but was stuck at close to 100%. Process behavior on the storage nodes indicated that this might be a Btrfs-related problem, so we decided to upgrade the storage nodes to Linux kernel 3.19 after consulting with external Ceph specialists. Since Ceph's data replication was degraded already, we had to cut off cluster I/O entirely for a couple of minutes to prevent data loss while rebooting the storage nodes one by one.
After the kernel upgrade, CPU load went back to normal immediately and Ceph recovery continued. We decided to perform a clean rebuild of the storage nodes (again, one by one) as soon as data replication would have been completed.
Pinpointing the root cause and final recovery
By Tuesday morning, Ceph recovery was stuck again. Digging deeper it turned out that disk usage had been reported incorrectly by Btrfs all this time. "du -h" for instance showed a usage of 55% only. In reality there was virtually no free space, causing most operations to freeze. As Btrfs seemed to be the source again, we decided to cut off cluster I/O once more and switch to XFS by moving all data out to spare spinning disks temporarily and then transfer the data back to the freshly XFS-formatted SSDs. We also added a couple of spare SSDs to the storage cluster to make for some breathing room. Further additional SSDs are already on their way and expected to arrive later this week.
After these changes, the Ceph cluster recovery finally completed. Data replication has been fully restored and the virtual servers are operating at the performance you would expect from us. And foremost, we were able to accomplish our most important ambition: not to lose any of your data.
We are sorry!
We have been in beta for quite some time and in retrospect, this incident has proved us right about it. However we are still committed to provide a service you can use productively, and being virtually down for almost three days is by no means how we wanted to earn your trust. We are very sorry about the troubles, big or small, this incident has caused. Rest assured that we will review all actions taken and make the necessary improvements to live up to our standards.
Last but not least we want to thank the guys at Hastexo for supporting our engineers throughout this incident and on such short notice.
your cloudscale.ch team