Design for Failure in the Cloud. Actually, Everywhere.

By Avi Deitcher

2015 Mar 5

In one of our earlier discussions about cloud, an astute reader pointed out that one "downside" of public cloud, especially one like AWS, is that they make very few guarantees about your instances. While the system as a whole has service level agreements (SLAs), your particular instance does not. To quote:

"If your instances go down you're going to have to deal with it"

The underlying assumption, of course, is that you have better control over the level of availability of your particular instances and their underlying hardware, especially scheduled maintenance, when you control the entire environment rather than leaving it to a cloud provider like Amazon or Rackspace.

Let's look at a practical example.

You have an important business process running on a virtual server. Lo and behold, the server is showing a problem with memory or perhaps a disk. As a cautious and conscientious provider, you decide to deal promptly with the issue and switch it out. Why court serious problems?

What happens to your process running in a virtual instance in each case?

Private Hardware

You find an optimal time when customer usage is lowest and staff availability is highest, probably a Friday night
You inform customers and set up downtime, or prepare in advance to migrate customers to an alternate node
You set up pre- and post-tests to ensure everything migrates correctly
You bring bring the instances down, then the hypervisor, then the server
You replace the failing component
You bring it all back up
Everyone goes home to sleep

Cloud Provider

Maybe you are informed of the pending loss of your instance(s); perhaps not
Even if you are informed, you are not given an option to request a different time
The instance goes down, whether you are ready or not

La Différence

The key difference, of course, is control over the timing. Normally, cloud providers will provide plenty of warning for a decommission of the underlying hardware or software - I have been getting warnings about the Heroku Platform-as-a-Service retiring client apps on their old Bamboo stack for months - giving you time to move, but scheduled maintenance, let alone unscheduled, cannot be left to the wishes of each cloud customer; the scheduling would become impossible.

When working in the cloud, you must assume that your instances will fail unpredictably.

In this respect, the commenter is 100% correct: you cannot plan for instance availability.

However, there is a deeper truth: this is very good.

Lack of guaranteed service levels for each instance forces us customers to architect our applications differently than before. We have to build them around the assumption of failure at an individual level, with resilience at the service level.

This is more than just "redundant": I have helped companies with a primary and backup data centre and failover procedures. It is even deeper than "resilient", although that is closer.

What you need is an architecture of "Design for Failure". D4F assumes that every instance of your application is ephemeral, and can disappear as a wisp without warning, but your system as a whole is available. My preferred name is "Ephemeral Design".

On the one hand, this can be unnerving for managers, architects and engineers used to "traditional" paradigms of architecture (funny how we call architectures that are barely ten years old, "traditional"). On the other hand, it leads to an overall better designed application and higher availability... at lower cost.

Which begs the question: should applications be designed this way even in private clouds or private hardware?

Ignoring for the moment the argument that private clouds and private hardware fast are becoming expensive solutions to very niche problems, having applications that run well in the "unpredictable" cloud run much better in your own environment as well.

Microservices

Arguably the biggest infrastructure buzzword of late 2014 (after "Docker"/"containers"), microservices involves decomposing your application from a monolithic design, or modules tied together, to individual components, each of which:

Runs independently
Communicates solely via well-defined API, usually over the network
Maintains its own storage
Can be deployed, replaced or upgraded at will

By taking your application apart into individual services, you gain the ability to manage each at will. Each then becomes less complex than the whole, making it far easier to refactor into D4F.

Maintenance

With the assumption that your underlying service will disappear at any given moment - it is, after all, ephemeral - you no longer need to schedule maintenance. When the memory fails, just replace it. When the server is end-of-life, get rid of it.

The cost of scheduling maintenance adds up very quickly:

Labour to schedule with customers
Reduced customer satisfaction due to scheduled downtime - no matter what they say, customers hate downtime, even scheduled, even if contractually agreed
Weekend / evening labour time
Loss of productivity in the following days

Failure

Perhaps most importantly, when you do suffer a failure, the word "suffer" does not need to occur. Whether it is an instance, a hypervisor, or an entire physical system, it just doesn't matter. There is no panic, no rush, no soothing irritated customers. Just replace it and move on.

This does not mean that you cannot have a catastrophic failure. It does mean that:

Smaller failures no longer matter as much as they used to.
Catastrophic failures occur less frequently

Summary

Designing software for use in a cloud service that has systemwide, rather than instance-specific, SLAs requires a fundamentally different design than before. It requires Ephemeral Design.

While challenging, the design enables more resilient and nimble systems at a lower ongoing cost, while enabling new capabilities, like microservices, that increase the resilience and nimbleness even more. These benefits accrue even when running on your own hardware.

If you are planning on migrating to the cloud, ask if your architecture is built for the cloud. If you are already in the cloud and are struggling with service levels and costs, ask if your architecture is built for the cloud.

If the answer to either of these questions is not sufficient, or you are unsure, ask us for help; don't wait until your competitors are more nimble and lower-cost than you are.