Cloud Series Part 3: Activate the recovery plan

June 5, 2017 Lee Benning

AWS Blog Series Part 3: Activate the Recovery Plan

This is the third and final article in our series on surviving major cloud outages. You can read Part 1 here and Part 2 here.

Adoption of cloud services is now the norm. As security concerns are overcome and enhanced feature sets are offered by the major vendors — combined with attractive pricing — our customers are finding fewer reasons not to take the leap into cloud. What can we advise our customers to do, to ensure it’s not a leap into the unknown? How can we reduce risk and bring back confidence? 

Things will break; even the most reliable services will experience an event of some kind. The key is in preparation and planning, to ensure the next cloud outage that hits our customers is only a hiccup, and not a potentially incapacitating heart attack. Many years ago, after a particularly hair-raising training sortie, an instructor of mine told me, “There is nothing to fear about the unexpected.” You have to expect things to go wrong from time to time. The key is to be well prepared to handle issues when they arise. 

As part of cloud strategy sessions with clients, be sure to cover the following areas to help reduce the impact of cloud outages.

Understand Your Environment

What systems are running in your clients’ businesses? This is not specifically about the cloud or only cloud-based systems. It’s worth doing for all company systems. In more complex environments, there are a lot of inter-relationships that have grown over time into an ecosystem of data links and exchanges. Just because a particular system is not living in the cloud, doesn’t mean a cloud outage won’t affect it. This means carefully discovering and laying out a map of all your clients’ systems, their components, details around data, user access, and much more. 

What is the impact of each system if it becomes unavailable? Design a business impact analysis (BIA) plan for the organization. The goal here is to understand and rank the most critical apps and business processes. From there, understand what an outage would cost per hour/day if it was not working as normal; this can often extend into other cost areas. When I performed such a study at a FMCG client some years back, we were surprised with the result. In that case, the costs from the loss of the ERP system for a day in lost production, ruined raw materials, standing charges for goods not transported, logistics costs for re-routing trucks and fines from customers for missing urgently needed stock amounted to £750,000 ($969,000)  per day! 

This type of planning will help the client understand which systems are critical to recover; and, how quickly they need to be recovered, which in turn will justify the expense of protecting them and keeping them available.

Understand Your Design Options and Limits for a DR Plan

So you have a BIA and you have a list of your critical systems. How can you mitigate that potential outage? Designing systems for availability is something of an art. It involves balancing costs of standby systems and unused or warm cloud resources against the potential costs of losing access to a critical system and its impact on the business. This balancing act must be combined with an assessment of the probability of an extreme event occurring, which could affect a client’s systems. 

If a system outage costs the company £1,000 ($1,300) per day, the cost of continuous protection of that system does not warrant a £100,000 ($129,000) annual spend, for example. The first step is to understand the service availability requirements. Next, understand the options for protecting the system and its data. Map the options to cloud provider services such as data replication, or back up to cloud storage in a different region to your production cloud. Consider using a different cloud provider service altogether for backup data from your production environment. 

There are many options available today. Be careful to evaluate all the costs, both obvious and less obvious.

Don’t Put All Your Eggs in One Cloud Basket

Having a cloud strategy is great, but it also takes a lot of planning, coordination, continuous management, and optimization. As a result, many organizations like to find one cloud partner and work them directly. There’s nothing wrong with that at all as a strategy. However, think about the impact of losing that provider service in your region, such as with February’s AWS outage. What if the recovery zones we set up within the very same cloud provider don’t work? Or, what if the network link to them becomes overwhelmed with DR or failover traffic as many clients feel the impact of an outage and invoke their failover plans? 

New tools surrounding software-defined networks, network functions virtualization, and advanced load-balancing help organizations stay very agile. Furthermore, cloud control platforms like CloudStack, OpenStack, Eucalyptus and Scalr all help extend cloud functionality, visibility and simplify management. It may be a valid strategy to think about creating your clients’ recovery strategy outside of their preferred provider. Maybe an on-premises or hybrid ecosystem makes sense. 

Recovery options such as DRaaS can be leveraged where you only pay if a disaster actually happens and the service actually gets used. When the clients’ primary cloud goes down, it’s possible to create a failover platform which is completely transparent to the end-user. Often, clients will say they don’t care where the service comes from, as long as it meets requirements of SLA and cost, and often compliance or data protection needs.

Designing for and Operating in the Cloud is not a Single Event

One key factor for cloud services is to continuously evaluate options. Do not assume a design for system resilience from 12 months ago is still as relevant or cost effective as it was when it was done. Evolve and ensure your plan is in line with your clients’ business. In today’s agile operating environment, a recovery plan should never be considered to be set in stone, but instead, evolving and adjusting. As soon as an application or system is updated or modified, check its failover and recovery strategy. As soon as new data sources are added, test out the dependencies to other databases and workloads.

A broken link in the outage recovery plan chain can prolong the failure and create additional challenges and costs. Because of the importance of IT in many business operations, the recovery plan must coincide with infrastructure changes, updates, business initiatives, and evolving IT strategies.

Test the Plan from Time to Time and be Confident it Works

Test out the recovery plan. What good is a plan if it’s never tested? Test out your clients’ recovery systems to ensure proper failover. In fact, you should test out various failure scenarios — weather-related, malicious, and accidental events should all be planned for. It’s critical to test various components in failure mode to ensure you have proper failover capabilities. 

Here’s the thing, however, you do not need to test recovery strategies on your production systems. Virtualization technology within cloud enables you to clone and create ‘production-like’ silo test beds to check recovery of the most critical apps and data points. From there, you can tweak your plan and method. Often the results of the recovery testing can be used to optimize the production environment. You can even try different providers by way of experimentation to see which works best. 

This has costs, but again, it is relative to the costs for losing systems. Companies place more emphasis on recovery when they know the cost of an outage. Have data to support the argument.

And Finally…

The key point here is to, at the very least, convince the client to have some kind of plan in place to protect against a cloud service outage that could affect their business. Technology is not fail-safe, and complexities of cloud operations add to the risks of adopting it in almost as many ways as it does to reduce risk and simplify operating it. Even a platform offering ‘six nines’ of availability means that things might go down for 20-30 seconds during the year. Being prepared isn’t a luxury that can’t be afforded, unless the costs to a business from outages to a service can. Rather, it’s an insurance policy, no different to the ones we buy to protect our homes from burning down or our cars from being stolen. 

The mindset is similar: it might not happen, but what if it does? In the world of cloud, be prepared for volatile and changing parameters. The more prepared you are around your critical systems, the better you have designed them given the options available, and the quicker you can recover.


Lee  Benning

Lee is the Chief Cloud Architect for Leidos UK/Europe.

More Content by Lee Benning
Previous Video
All-Terrain VACIS® Scanning System
All-Terrain VACIS® Scanning System

Leidos's versatile All-Terrain VACIS mobile scanning system allows customs and security agents to search ve...

Next Article
Cloud Series Part 2: Cloudy with a chance of human error
Cloud Series Part 2: Cloudy with a chance of human error

Amazon’s Simple Storage Service (S3) suffered a catastrophic technical malfunction.