2012 June 29 Outage:
No Excuse Storms Should Not Take Down Your Cloud. – InfoWorld – 2012 July 3
Amazon Cloud Goes Down Friday Night, Taking Netflix, Instagram and Pinterest With It. – Forbes – 2012 June 30
As David Linthicum observes in InfoWorld, data center operators need to practice disaster recovery scenarios. You might think that Amazon’s June 14th outage would have provided sufficient practice. Hypothetically staff availability could have been one of Amazon’s problems in this most recent outage. With many roads blocked and power out, local employees not already on-site would be unable to get on-site, work remotely, or even to phone in once their mobile device batteries ran out. And one wonders if the facility security systems were operating on backup power. (I had a relative arrive at San Francisco Airport coincident with the Loma Prieta earthquake. He was able to get to but not into his Pacific Heights home until the power was restored thanks to the building’s fancy security system.)
As I noted in my post below on the June 14th outage, a good business continuity plan needs to account for all the resources needed to both withstand and recover from a disaster.
Anthony Wing Kosner makes another good point in his Forbes article: buyers of data center services need to ensure that their buyers and users are kept appropriately informed during an event that compromises performance and/or availability. In the postmortem analyses, Kosner and others observe that buyers need to take responsibility for distributing their resources to withstand a server, zone, or site outage – and that Amazon recommends such measures. Using the problems with home mortgages in the US as an analogy, perhaps the issue is not only a lack of full disclosure, but the difficulty for the buyer to understand the disclosure.
2012 June Outage:
In commenting on Gretchen Curtis’ (Piston Computing) blog on the 2012 June 14 AWS outage, Matt Prigge (InfoWorld) echoes the need for disaster recovery and redundancy for ‘mission-critical infrastructure’, but then counters her argument against public cloud with recommendations for deploying redundant hardware and utilizing multiple availability zones and/or cloud providers regardless of whether the cloud is public or private. I view this outage a little differently: you get what you pay for, so if you put cost ahead of QoS, then a commodity cloud product is probably going to be your best choice – and you’ll have to put up with the outages. The tradeoffs are not between public and private, but rather between ‘better’ and ‘cheaper’ primary value propositions.
There are plenty of public cloud products offering a primary value proposition of ‘better’ – just as there were (and still are) plenty of hosting and traditional IT infrastructure outsourcing providers who offered high availability, including data centers with backup generators on-site to provide power redundancy – but they are more expensive than their commodity counterparts.
Talking about on-site backup generators, in my experience just having the generators on-site is insufficient. More than once I’ve seen the generators fail to takeover power provisioning in a timely fashion. Given that many large data centers are located in hot climates (for example, in the US there are data center clusters in Las Vegas (average July temperature is 104 degrees F), Sacramento (record high was 115 degrees F), Phoenix (average July temperature is 102 degrees F), and Reston in Northern Virginia (record high was 104 degrees F)), battery backup to the servers has neutral to negative impact if an extended power outage of power has disabled the data center cooling. (The impact of battery backup is negative if the servers do not gracefully automatically shutoff when the ambient temperature is outside of the acceptable operating zone.)
In the case of power, whether the data center is public or private, IT redundancy inside the data center is not going help if there’s a loss of the single power supplier or power grid to the data center. Likewise, a loss of the single communications supplier or communications network to the data center cannot be overcome by IT redundancy (though there is less impact on the hardware). There are data centers which avoid these common SPOF’s (single point of failure), but since most companies do not require this level of continuity for all their business activities, most data center providers either assume that their customers will, as needed, deploy local and geographic high availability solutions (such as a private dark fiber run to a dedicated space (e.g. a cage) in the (shared) data center), or offer solutions such as disaster recovery of a configuration at another site (in a data center on a different power and communications grid).
I recommend that risk mitigation of the data center infrastructure (including external services such as power, communications, and transportation (of labor and equipment that has to get on-site)) be considered in parallel to – but not completely independently of – risk mitigation of the IT operating within that data center. So weigh the tradeoffs between local application failover and server component redundancy to mitigate the risks of server failures in parallel to the tradeoffs between a data center with (tested) on-site backup generators (or connection to two power grids) and a geographic high availability (failover of the application to a data center on a different grid).
Risk mitigation of the business processes that are enabled by that IT should also be considered in parallel. The Red Cross cannot just failover a regional data center, but must also failover the affected portion of their blood collection and distribution networks (including the pickup and delivery routes taken by their trucks).