Our Monitoring and Alerting Journey
The cloudscale.ch infrastructure not only forms the basis for services we offer, but also provides the backbone for everything that our customers build on it. This is why our monitoring continuously checks that all our components are "up" and interacting as they should and raises an alert if an intervention is required. Over time, we have increasingly fine-tuned and optimized our monitoring, which means that even problems within the monitoring setup do not remain undetected and that, at the same time, unnecessary alerts are reduced to a minimum.
Our tried-and-tested monitoring basis
Thanks to redundancy at all levels, our customers are unaffected by most isolated problems. Irrespective of whether a cable, a hard disk or a load balancer fails, the overall system continues without interruption. It goes without saying that we need to follow up on such cases and, for example, reinstate redundancy in order to ensure that our cloud remains as reliable as it is. Beyond detecting and reporting defective components, our monitoring also covers performance data and the correct functioning of complete end-to-end processes, which enables us to identify any required action in good time.
From the very beginning, we have relied on Zabbix as the linchpin of our monitoring and alerting at cloudscale.ch. Thanks to its versatility and adaptability, we have been able to use this tool to cover most monitoring requirements, which have increased in number and complexity over time. To complement our internal Zabbix, we also added external monitoring early on. On the one hand, this has allowed us to replicate additional use cases and to better include the user perspective. On the other hand, we are able to cover cases where our own monitoring is also affected by a problem and/or the generated alerts cannot be sent out for some reason.
Wide range of optimizations
In addition to our internal Zabbix, two external monitoring services now check the most varied aspects of our cloud that are "visible" from the outside, from object storage to API calls. All of this converges on the Opsgenie platform, which makes it possible to store the specified on-call schedules and to pass on alerts to the correct person. If the responsible engineer is ever unable to respond to a report immediately, it is automatically escalated to defined further persons. It goes without saying that the complexity increases with the number of services involved, which is why we use regular automated dummy alerts to test whether alert processing is working correctly and whether the setup is operational all the way to the mobile of the designated on-call engineer.
Escalating an identified problem correctly is only one part of the process. We also put a great deal of effort into optimizing the database from which the anomalies are extracted. Starting with an already broad set of values that monitoring systems typically read from a running target system, we added further checks that are in part even more hardware-oriented. This means that our monitoring can, for example, recognize if an NVMe disk has not negotiated the usual data rate on the PCIe bus. At the other end of the spectrum, abstracting from the hardware, we have an increasing number of checks that monitor the state of whole clusters without being dependent on a specific host to query. Thanks to a solid baseline of measurements, we can then determine threshold values in such a way that allows problems to be reliably identified without causing a lot of noise.
Why we can sleep at night
Although on-call engineers at cloudscale.ch tend to sleep through the night, the redundancy mentioned above and carefully considered threshold values only provide a partial explanation for this. Wherever an analysis during office hours is adequate, we have allocated a low severity level to the checks and set Opsgenie in such a way that nobody is woken up. Consistent follow-up processing is important here: even low-severity events and anomalies that occur during the day are investigated in a timely manner before they develop into a problem that necessitates getting up at night. If something ever has a greater impact, the same principle applies, and we almost always find a way to identify similar cases earlier or to avoid them completely in future.
On top of all this, there is one further, less technical aspect. Thanks to separate monitoring for our lab, new engineers can take the time they need to settle in without pressure, which enables them to quickly become aware of the things they need to pay attention to. And when resolving problems which cannot be completely avoided despite constant improvements, directly linked runbooks provide the required backing. If on-call engineers are woken at night, which tends to be the exception rather than the rule, it does not take long before they can go back to bed with confidence.
The reliable operation of our cloud infrastructure is key for many of our customers. It is based on the fact that we identify anomalies as early as possible, which allows us to avoid most problems before they have an impact on our customers. Our monitoring setup, which has grown and been consistently improved over the years, serves as our eyes and ears right into the furthest corners of our systems. At the same time, it contains the intelligence required to support and reduce the pressure on our engineers in their work.
You can sleep well (too)!
Your cloudscale.ch team