Handling Errors in Control Panel and API
Sometimes things do not work out, in IT just as in everyday life. The more complex the processes, the more likely it is that some component will not behave as initially expected. At cloudscale.ch we really care about simple and consistent use. For us, this also includes being prepared for any eventuality that may occur behind the scenes – after all, our interfaces should not only be intuitive to use, but also help you reach your target in the best possible way.
- What is under the hood of our cloud
- Examples of potential sources of error
- Pillars of error handling at cloudscale.ch
What is under the hood of our cloud
The first point of contact when using cloudscale.ch is our cloud control panel, which we developed in-house. This software, which is written in Python, provides you with the web interface and API to manage your servers and handles the billing process for services used. In the background, our control panel relies heavily on OpenStack. As one of the leading open source projects in this area, the cloud platform manages the physically available computing power, allocates IP addresses and configures internal networks between your servers.
Equally fundamental is Ceph, a distributed storage solution that is also maintained as open source and ensures the replicated and performant storage of your volumes and objects. However, the "smaller" building blocks in our setup are also essential, e.g. our DNS system, ExaBGP for the dynamic allocation of Floating IPs, and RabbitMQ, which acts as a kind of glue between our control panel and other involved systems.
Examples of potential sources of error
Much of what appears as one coherent action from a user perspective requires several separate steps in the background. In addition to the actual creation of a new virtual server, for example, a network port is created, an IP address assigned, a volume with the selected operating system provided and a reverse DNS entry set. No matter how well everything is tested, you can never completely rule out the possibility of encountering an unexpected error in any of the integrated software components.
So-called race conditions are another possible source of errors. To avoid inconsistencies, certain actions or parts thereof can only be executed one at a time; if the same step is also required by an action running in parallel, one of the two actions fails. Furthermore, some steps depend on additional conditions, e.g. that certain safety limits ("quotas") are observed.
Pillars of error handling at cloudscale.ch
In the context of error handling at cloudscale.ch, our primary goal is to ensure that every action results in a usable state. We have, therefore, implemented prepared rollbacks where appropriate. If an error in a sub-step would, for example, lead to a server that does not work and that, due to a subsequent error, can possibly not be deleted, the rollback function comes into play. It ensures that already completed sub-steps are reversed, so that – despite the error – a clean state is reached again at the end.
It goes without saying that it is even better to avoid failure as far as possible. This is why we permanently monitor our systems for error messages and check in each individual case whether this can be avoided in the future with a specific patch or other improvements. In addition, we have optimized transactions that could lead to race conditions so as to minimize the probability of actual parallel execution. Should a transaction nevertheless need to be aborted because it coincides with another parallel operation, the case is intercepted and the transaction is retried up to a defined number of times. In this way, the action chosen by the user might still be completed successfully.
User-friendliness has always been a main concern at cloudscale.ch. Even though the systems behind the scenes are complex and there is always potential for something to go wrong, the aim is for actions in our cloud control panel and API to lead to the desired result whenever possible. And even when this is not the case, nothing should prevent you from taking further/other actions.
Your cloudscale.ch team