Cloudscale Logo

Engineering Blog

Zurück
Publiziert am 4. Dezember 2025 von

Object Storage: Stabilität und bessere Performance unter wachsender Last

Dieser Inhalt ist nur auf Englisch verfügbar:

Since the price reduction of our Object Storage in February, usage grew significantly, including total storage consumption. This rapid growth surfaced some issues during load peaks and in our metrics pipeline.

RGW internals

Before diving into our load mitigation strategy, let's introduce some RADOS Gateway (RGW) internals. When an S3 request hits objects.<region>.cloudscale.ch, it travels through two major layers before hitting our Ceph Storage Cluster.

Frontend: The part of RGW that accepts client HTTPS requests, parses S3/Swift APIs, and places those requests into the internal queue of RGW. It behaves essentially like the "public-facing door".

Backend: The part of RGW that executes the actual operations against Ceph, such as reading objects, writing data, listing bucket contents, and performing metadata lookups. Backends are the "workers" that do the heavy lifting.

We run multiple frontends and backends to handle high loads. When a client sends a request, the path looks like this:

                   ┌─────────────────────┐
Client  ──https─►  │    RGW Frontend     │
                   │ (HTTP server layer) │
                   └──────────┬──────────┘
                   ┌─────────────────────┐
                   │    RGW Backend      │
                   │ (RADOS operations)  │
                   └──────────┬──────────┘
                   ┌─────────────────────┐
                   │        Ceph         │
                   │  (storage cluster)  │
                   └─────────────────────┘

Load Handling Improvements

New customers bring new load patterns, and one such pattern has caused our RGW backend processes to be strained. Loki, an increasingly popular log aggregation system, uses S3 for storage and for log queries. These queries use concurrency to speed up retrieval, showing up as sharp spikes on our frontends.

While our services have been tuned for increasing amounts of traffic over the years, these tunings are never final and new load patterns require additional traffic-engineering. Very spiky peak loads could be handled by adding lots of backends to take on the load, but that is neither economical, nor sustainable: We want to target a certain base-load, with some room for peaks. But we should not provision our services to support loads magnitudes larger than our average.

Maybe counter to intuition, we achieved better peak-load behavior by limiting the number of requests we accept in the backend, instead of adding more backends to handle increased load.

Instead of accepting whatever peaks we get, we can also try and flatten the traffic curve, which is what we do now, limiting the number of concurrent requests at every step:

  • The backends have a certain limit of requests they accept.
  • Once the backends are at capacity, the frontend queues requests to it.
  • Once the frontend is at capacity, the kernel queues requests to it.
  • Once the kernel is at capacity, requests may fail.

As these queues fill, we are informed by our monitoring. So in practice, only the first limit is ever reached. When this happens, some interesting properties come into play:

  • We start returning 503 Slow Down to the clients that cause the spikes: This is the standard HTTP code used by S3 to signal to clients to back off.
  • We start preferring certain requests over others: For now we give priority to read over write requests, but we might consider giving priority to clients with few concurrent requests over clients with many concurrent requests (for increased fairness).

So far this has proven to be very successful in preventing spikes from overwhelming our infrastructure.

Handling RGW slow-downs in rgw-metrics

The rgw-metrics service periodically queries the RGW for usage information about every bucket. This data is used in the Control Panel in the Object Storage tab to display historical usage information and for the billing. For a deeper explanation, see our engineering blog Improving metrics collection for our Object Storage.

Normally RGW runs smoothly and rgw-metrics receives reliable responses. This had always been the case, so the system was never designed to handle transient slow-down responses.

When RGW became busy under heavy production load, it became less responsive and returned 503 Slow Down responses. You may recognize those from the chapter above. rgw-metrics interpreted these errors as hard failures. Unfortunately, this triggered alerts for the on-call team as the usage data collection service restarted repeatedly. This reduced the number of collected metrics, lowering the resolution of the usage curves.

We now treat 503 Slow Down responses as temporary conditions and instead of failing immediately, the service now backs off and retries after a longer interval. This allows the metrics service to recover automatically after high-load peaks and prevents waking on-call engineers in the middle of the night.

The core lesson: Load related optimizations don't only affect customers, they also impact our internal systems. Our metrics service was an unintended casualty of a change that otherwise improved production stability.

Outlook

We continue to monitor the system closely and extended our tests to include more concurrency and load-related scenarios, so that we can catch similar edge cases earlier and ensure a smooth experience for our users.


Wenn du uns Kommentare oder Korrekturen mitteilen möchtest, kannst du unsere Engineers unter engineering-blog@cloudscale.ch erreichen.

Zurück zur Übersicht