OpenStack Upgrades: Open Heart Surgery

To us at it is important that the software we use is not only well-tested, but also up-to-date. While this allows for reliable operation of our systems, it also ensures the prompt availability of security updates. With OpenStack at the heart of our cloud, the upgrade to its major version "Pike" represents a milestone that we have mastered with minimal impact on productive operation.

Which role our redundant setup plays in upgrades

Our system design for high availability also serves us well in the case of upgrades. In this setup, all components are present in at least two instances, or at least three if a quorum is required. This not only protects us and our customers from unplanned downtimes in the event of a failing individual system, but also allows us to process the systems sequentially during maintenance work so that the productive operation of the entire service is guaranteed without interruption.

Accordingly, the OpenStack components responsible for creating and managing the virtual servers have been updated system by system. And since at least two instances of each component were always in operation, the management of the virtual servers remained available to our customers via cloud control panel as well as through the API. In the case of the compute nodes on which the virtual servers run, an OpenStack upgrade would even be possible during production operation. However, simultaneous upgrades of the underlying Linux system often require a reboot. Even in such a case your virtual servers remain online without interruption, thanks to prior live migration to another compute node.

How OpenStack supports non-disruptive upgrades

OpenStack is a complex system of different components that interact with each other. It is not entirely self-evident that it is possible to update these components (or, in the case of a redundant structure, only individual instances of them) one after the other: in the course of this process, part of OpenStack's overall system is still in the old state, while the other part is already on the newer level. The OpenStack project therefore consciously makes sure that the individual components also work together with components that are still running the previous version. Only then is it possible to keep the system as a whole functional during the entire upgrade process.

For the recent upgrade to OpenStack "Pike" we started as usual with comprehensive tests in our lab environment. We optimized our Ansible playbooks so that at least two instances of each OpenStack component remain available at all times. In order to make sure that the interaction of the components also across version boundaries works as expected, we continuously created new virtual servers via API – a potential problem would have been detected in the lab at this point.

What we do to reduce the impact even further

In some cases short interruptions (usually less than 5 minutes) of the cloud control panel and the API could not be avoided, e.g. if the configuration of a load balancer had to be adjusted to the new version of the OpenStack component behind it at the same time. While running virtual servers remain unaffected by this, changes are not possible in such a moment ‐ especially not moving a floating IP to another virtual server, which many of our customers use as a failover mechanism for highly available setups.

We are therefore working on limiting necessary downtimes even more by, for example, blocking affected operations while the control panel and API in general remain available. During the upgrade to "Pike", we were already able to use such a mechanism: when major changes in OpenStack's own API temporarily prevented the scaling of volumes, we were able to reflect precisely this in our interfaces and continue to allow all other actions.

Upgrading a system as complex as OpenStack is a matter of several hours. It is not trivial to ensure the best possible availability throughout this process. We are already well positioned here having redundant systems at every level. Where interruptions are unavoidable nevertheless, we try to keep them to a minimum – ideally, only a single operation needs to be temporarily blocked. And as always: test, test, test.

Seriously prepared,
Your team

Back to overview