Building Stability from Microservices

This article mainly introduces the common problems faced by stability construction from the perspectives of "preventing stability risks" and "reducing the impact of failures" based on microservices.

1. Preventing Stability Risks#

Microservices architecture makes the functions of microservices more cohesive and allows for faster iteration speed. However, it also increases the complexity of service dependencies, thereby increasing the difficulty of stability construction. Despite the complex dependencies, they can be abstracted into the relationships between upstream services, self services, and downstream services. The main idea for preventing stability risks is to prevent risks in these three areas.

Up, Middle, Down

1.1 Preventing Upstream Risks#

Rate limiting, input validation.

The common risks to prevent in upstream dependencies are "increased traffic" and "input errors". The expected increase in traffic can be evaluated in advance and appropriate response plans can be prepared. For unexpected increases in traffic, pre-set rate limiting measures should be in place.

The purpose of rate limiting is self-protection or isolating the impact. After core traffic is limited, the impact can be evaluated, and then capacity can be expanded or temporary adjustments to the rate limiting threshold can be made.

"Input errors" commonly occur when there are no restrictions on range parameters. For example, if only one day of data is expected to be queried, but the request parameter is set to query one month of data, the database may not be able to handle the pressure and crash due to the lack of restrictions on the interface.

1.2 Preventing Downstream Risks#

Removing strong dependencies, degradation, testing weak dependencies, flow switching plans.

In the industry, dependencies that do not affect the core business process and system availability when exceptions occur are called weak dependencies, while those that do are called strong dependencies. The most direct way to prevent downstream risks is to remove strong dependencies.

When designing the system, it is necessary to comprehensively analyze the strong and weak dependencies of the system. After the system is launched, online traffic can be collected and further analyzed to understand the dependency relationships.
It is necessary to transform historical business and make trade-offs in terms of functionality, user experience, and stability. To ensure stability, the downstream dependencies of the core functions should be minimized. Non-core functions should be cut off when there are failures in downstream dependencies to ensure that the core functions are always available.

Degradation plans should be established for weak dependencies. Various open-source traffic governance components such as Sentinel can be used. To ensure the execution efficiency of the plans, it is recommended to use fault-tolerant business code combined with automatic circuit breaking.

The choice of degradation methods depends largely on the impact of business degradation. For functions with a significant impact after degradation, manual degradation should be used. For functions with a smaller impact or functions that can be quickly automatically repaired in the future, automatic degradation can be considered.

It is necessary to regularly verify the governance of strong and weak dependencies. If the interfaces or services are relatively simple, unit testing can be used for verification. If there are many complex services, regular fault drills are needed to identify potential issues.

For strong dependencies that cannot be removed, some methods can be considered to reduce risks, improve stability, and prevent major incidents.

For MySQL, adding enough shards can reduce the impact of a single shard failure.
Establishing emergency response plans as a backup and providing a good user experience.
Prioritizing flow switching in the event of a failure in a single data center.

1.3 Preventing Self Risks#

Architectural risks, capacity risks, flow switching plans, standardized online changes, development and testing quality assurance.

Basic measures to avoid single points of failure include redundant deployment and active-active flow switching. Elastic cloud and automatic scaling can reduce capacity risks. Periodic sentinel stress testing, end-to-end stress testing, and module-level stress testing can be conducted for capacity evaluation.
From the frequent causes of online incidents, code changes and configuration changes account for the majority. Therefore, improving development and testing quality and strictly adhering to online change standards are the key to preventing self risks.

To improve development quality, it is important for developers to have the awareness of writing automated test cases from the perspective of stability. Although writing test cases may increase the time cost in the short term, it can greatly improve the testing efficiency and code quality in the later stages of iteration. For core business systems, continuous iteration is inevitable, so the long-term cost of writing test cases should be acceptable.

2. Reducing the Impact of Failures#

Mistakes are inevitable for humans, so failures are unavoidable. In addition to preventing risks, we also need measures to reduce the impact of failures.

2.1 Self Interface Degradation#

Clarify the strong and weak dependencies of the core links, degrade interface capabilities.

As part of the business chain, it is necessary to clarify the strong and weak dependencies of our services in the upstream core links. If our services are weakly dependent on upstream services, we need to ensure that the interfaces being relied upon support interface degradation. If our services are strongly dependent on upstream services, we need to consider pushing upstream to remove the strong dependency on our services. If it cannot be removed, we need to consider building alternative channels or other solutions to reduce the impact from upstream, such as user-oriented fault guidance messages or announcements.

In summary, we need to not only focus on the stability of our own services but also pay attention to the upstream dependencies on our services and establish response plans to reduce the impact of failures on upstream services. Please note that the interface capability degradation here is different from the dependency degradation mentioned earlier. The interface capability degradation here refers to the degradation of our own service capabilities, aiming to reduce the impact on upstream services. The dependency degradation mentioned earlier refers to the degradation of downstream dependencies when the service is at different levels.

2.2 Fault Perception and Localization#

Monitoring and alerting, fault root cause localization, emergency response procedures.

It is important to monitor and alert the core service indicators and business indicators to achieve as close to 100% coverage as possible. Coverage is one aspect, and the timeliness and accuracy of alerts are also crucial. Building observable links, traceable logs, and visualizing server performance are effective tools for fault perception and root cause localization.

When building indicators, it is recommended to standardize metric indicators, which can reduce the understanding cost and improve problem localization efficiency.

To improve the timeliness and accuracy of core indicator alerts, it is recommended to focus on monitoring from a certain direction to reduce maintenance costs. Monitoring based on business result indicators is recommended (process indicators can be used to assist in problem localization). The reason is that business process indicators are numerous, change frequently, and may involve multiple systems, making them more scattered. On the other hand, the results of business process indicators tend to converge.