/ Shayon Mukherjee / blog

Handling Network Failures in the Cloud

May 9, 2020
~4 mins

I originally wrote this as a part of a submission to an online series/publication late last year. Now releasing it in the open.

In the era of Cloud Computing, network failures and especially transient ones, are a given. These failures come in every form and originate from servers, switches, routers, load balancers, workers, connection pools, software applications, human errors and of course the DNS. Which means, writing software applications for the cloud, in a distributed system environment requires an added degree of care and resiliency mindset. A mindset to incorporate practices during software development that will allow your applications to withstand these failures without disrupting customer experiences.

A common way to handle these failures is through the use of timeouts, retries, retries with upper bounds known as backoff and jitter. As a “cloud engineer” I take it a bit personally in introducing these practices when dealing with network connectivity or similar communication protocols over the Internet. Because, things always go wrong in production.

Timeouts, in simple terms refers to the maximum time allowed for a connection or a request to resolve or sit idle for. Lack of timeouts when combined with connectivity issues to a downstream service often lead to increased latencies and resource starvation in servers. In such a scenario, a client and server are waiting on something that may never fulfil, resulting in a non-optimal customer experience. It should become part of your DNA to put a timeout on any network call in order to reduce the blast radius of an impact with partial degradation.

The software application should raise and return a special error message when a timeout is breached and the client should accordingly deal with it. Oftentimes it is with retries.

Today, many modern day applications or clients provide the ability to implement timeout on network calls. The hard part with timeouts is to be able to figure out the time limit. What may work for a DNS resolution may not work for a database query. A general rule of thumb is to look at past request duration (latency) of service(s) involved and accordingly find a figure that makes sense such that when not breached, would result in non-optimal customer experience. Consider this as an iterative process. It may take a few iterations to settle on a figure that is sustainable (nor too high or too low).

Retries, as mentioned above, are a nice way to combat failures. Due to the nature of transient failures in the cloud, a subsequent call or request after timeout often yields into a success. When a client receives an error message or HTTP response back for a timeout, it is the responsibility of the client to retry. Retries in nature can be belligerent, such that retries without upper bound is a recipe for DDoS’ing your own systems. As a cloud engineer, unbounded retries should tinkle your spidey senses. Additionally, putting retries in place for every network call in the stack may not be wise either. Pick your battles.

Backoff, is a technique to perform retries gracefully, without overloading or burning out your backend systems. A very simple way to perform retries is by adding a delay between calls. It is also called a linear backoff. While easy to implement and can handle transient failures in majority cases; linear backoffs do not help scale so well when a downstream service is impacted for a prolonged period of time, as it would continue to overload the service at a fixed rate.

Exponential Backoff, is a less aggressive form of backoff. As the name suggests, the duration between each retry increases exponentially until it succeeds or an upper bound retry limit has been hit. This is a graceful way because it allows downstream servers to not overload and result in resource starvation.

Backoff with jitter, is another beneficial technique. While exponential backoffs allow you to spread the retries more scientifically than linear backoff, it still leaves the backend systems open to request bursts on every retry, leading to resource starvation or overloading. To deal with this, we add jitter to our backoff strategy. In other words, we introduce randomness to retry intervals. Instead of retrying at a fixed interval (exponentially) each client would retry at varied interval rates. This is especially beneficial when a large number of clients are distributed and are coordinating with a specific set of central backend systems.

Last but not the least, test and verify these settings in production :).

last modified May 9, 2020