How to design a robust service

After resigning recently, during the preparation for interviews and resumes, I reviewed the service issues that occurred during my previous job. In some incidents, it was because one or several services were overwhelmed by peak traffic, possibly due to database issues or various other problems such as memory. At a deeper level, the services themselves were not robust enough, and when one or several nodes of the service were overwhelmed, it led to a retry storm and subsequently a cascade. If these services were part of the main process, it could potentially cause a system-wide cascade.

Many of our services perform as shown in the graph above under pressure. They can provide normal service until they reach their maximum load, but once the traffic exceeds the maximum capacity the service can handle, their external service capability drops sharply. Even if the traffic is not overloaded at this point, and even if the traffic is completely degraded to 0, the service capability takes a long time to recover.

In some cases, even after waiting for the service capability to recover and switching back the upper-level traffic, the service is instantly overwhelmed by a large number of manual or automatic retry requests.

If our backend services are so "fragile" that they immediately collapse under overload and take a long time to recover, then our stability assurance work cannot be done. In large applications, there are thousands of services in the entire chain, and each service has many nodes. In this situation, it is difficult to guarantee and monitor the stability of all nodes and prevent the system from being overwhelmed by traffic. This situation is also impossible to achieve because unexpected local exceptions will always occur.

Therefore, we need to ensure the robustness of the entire application, and each service should have a certain level of robustness. For an individual service (whether it is standalone or distributed), this requirement is not high and not difficult to achieve.

To achieve this, a "robust" service should perform as shown in the graph above under pressure. When the load increases, the output should increase linearly. When the external load exceeds the maximum load the service can handle, after a short period of fluctuation, the service should be able to stably provide output regardless of how high the external load is, and this output should ideally be close to the theoretical maximum load. In other words, even in the case of overload, a "robust" service should be able to provide stable output, and requests that exceed its processing capacity should be promptly rejected.

Each node of a service needs to have metrics that can real-time represent its service load. When the metrics exceed the service capacity limit, requests that exceed the service capacity should be rejected. There should be a feedback mechanism between the server and the client (here, the client refers not only to the app but also to the client of the service itself, which generally integrates with the backend's load situation and adjusts its own strategy. In another service's SDK, or even in the app's SDK, the same applies).

In simple terms, for each node of a service: 1. You need to know if you are busy or not, and if you are close to your limit. 2. If you are busy, you need to let your client know, so that it doesn't accidentally overwhelm you.

Next, let's discuss these two points in detail: how to find metrics that represent the service load and how to negotiate with the client during overload.

Metrics Representing Service Load#

Metrics representing service load refer to the metrics that, when they exceed a certain threshold, cause the service node's service capability to deteriorate sharply and take a long time to recover.

We can use load testing to find reasonable QPS, TPS, and other values in a standard server environment. Then, we can use tools such as rate limiting and circuit breaking to set peak limits. However, depending on the actual situation, the peak limit can be set within a certain range:

If the service node can be horizontally scaled, it does not need to be too precise. There is no need to excessively pursue running the service at the CPU or other resource limits. Based on the results of load testing, set a less aggressive value, and further squeeze server resource utilization through mixed deployment or overselling.
Consider the timeout period; otherwise, even if the request is processed, it may be useless because the client may have already timed out and discarded the request.
Consider whether some machines will run other functional services, such as machine monitoring tools, primary-backup node synchronization, and heartbeat functions. Therefore, there should be some leeway in setting the threshold. More sophisticated services can even consider request prioritization and set different counters for each priority.
Some people may wonder if the threshold of this counter can be made adaptive, that is, dynamically adjust the threshold based on the internal resource situation of the service node. To be honest, in most cases, this is not necessary. Simplicity is less prone to errors, and in very few cases, it is necessary to focus on the performance of a single node. In the era of exploding machines, service stability is more valuable than machines.

Collaboration Between Client and Server#

To be honest, I don't know much about the client, but this problem is actually an optimization of the retry logic when the backend is busy.

The retry strategy of the client should be a very refined logic, but it is easily overlooked. The most common retry strategy that everyone can think of is simple interval retries, and more thoughtful ones may consider gradually increasing intervals based on geometric or Fibonacci sequences.

But if you think about it carefully, retries are actually used to solve two types of problems. One is failover for network packet loss or momentary disconnection, and the other is failover for server node failures. For the former, the smaller the retry interval, the better, preferably instantaneous retries. However, for the latter, if the interval is too small, it is easy to cause a retry storm and make the backend nodes even more overwhelmed. However, the tragedy is that from the client's perspective, we cannot distinguish which situation is happening.

Therefore, the simplest solution is that when the server's node is busy, instead of simply discarding the received requests, the server should return RET_BUSY for these requests. Once the client receives RET_BUSY, it should exit the current retry logic and significantly lengthen the retry interval to avoid causing a retry storm. Furthermore, the server can classify the return values based on its own situation, for example, RET_RETRY, which prompts the client to immediately initiate a retry, and RET_BUSY and RET_VERY_BUSY, which prompt the client to lengthen the retry period to different extents. There are many sophisticated techniques that can solve complex problems, and you can try them out based on your own engineering scenarios.

So far, we have explained how to create a "robust" service. Essentially, it uses very simple engineering methods to implement a service that can preliminarily recognize its own busyness and a client that can adaptively retry based on the server's busyness. This ensures that the service can provide controlled output regardless of the external request pressure.

In actual engineering practice, the server nodes should be able to provide continuous and stable external services even when the average CPU utilization exceeds 70% and the instantaneous CPU utilization exceeds 90%. In the testing environment, they should be able to provide continuous and stable output under extreme conditions.

Can the above robustness be achieved without modifying the service itself?#

Can this mechanism be implemented externally through monitoring and traffic degradation plans, so that the service itself does not need to be modified?

First of all, external monitoring + traffic degradation plans are essential and must be in place. They are the ultimate fallback plan for everything. However, monitoring has a lag and traffic degradation is lossy, which means that degradation needs to be handled with caution. These factors determine that the external monitoring + traffic degradation mechanism is used to solve failures rather than make the service more robust. The robustness of the service itself relies on the service itself to solve.

Which services need to be reformed?#

The above mechanism can be implemented in RPC frameworks and is sufficient for most services that do not have high requirements.

Distributed storage systems have higher performance requirements. The engineering difficulty of storage systems in terms of stability is indeed the most complex and fundamental. Storage nodes cannot be simply restarted to solve problems. In addition to CPU, there are also considerations for memory, page cache, network throughput and interrupts, disk IOPS and throughput, and many other resource dimensions. Furthermore, after traffic degradation, storage systems usually cannot quickly recover their services (for example, they may need to wait for master-slave synchronization, minor or major compaction, or disk flushing). Therefore, the robustness of online storage systems usually requires consideration of more factors.

Overall, the text discusses the importance of robustness in services and provides insights into how to achieve it through metrics, collaboration between client and server, and the need for service reform in certain cases.