July 4, 2011

The 10 worst cloud outages (and what we can learn from them)

Sending your IT business to the cloud comes with risk, as those affected by these 10 colossal cloud outages can attest

As a concept, there's a lot to like about the cloud. Drop those bulky servers and get yourself a big, white hard drive in the sky. Someone else handles the upkeep and lets you put your data where you want it. Even the word "cloud" itself brings to mind a heavenly (if slightly fluffy) fantasy.

Best Microsoft MCTS Training – Microsoft MCITP Training at Certkingdom.com

The reality is, of course, a mixed bag. What you gain in avoiding upkeep, you lose in control. And the security concerns are considerable. But nowhere is the nightmare as vivid as it is when your cloud service goes down.

[ Get the no-nonsense explanations and advice you need to take real advantage of cloud computing in InfoWorld editors' 21-page Cloud Computing Deep Dive PDF special report. | Stay up on the cloud with InfoWorld's Cloud Computing Report newsletter. ]

Just ask any of the businesses affected by Amazon Web Services' high-profile outage in April.
"We were pretty blown away," says Nick Francis, whose startup, Help Scout, had launched just one week prior to Amazon's problem. "We definitely weren't prepared."

Francis wasn't the only one caught off-guard. Big-name properties like Reddit and Foursquare fell flat when Amazon's cloud sputtered.

"The cloud has been sold as this magical thing that just works and is totally reliable," says Lew Moorman, chief strategy officer of Rackspace, a cloud provider that's seen its fair share of outages. "The truth is that buying through the cloud is another way of buying computing, and computing is inherently flawed. If you want to make sure those flaws don't hurt you, you have to plan ahead."

To help keep your business pain-free in the cloud, we offer these hard-earned lessons at the hands of 10 of the worst cloud storms the Web has weathered.

Colossal cloud outage No. 1: Amazon Web Services goes poof. Freeing yourself from network maintenance gruntwork is a chief selling point for doing business in the cloud. The downside? Standing by helplessly when your cloud vendor's routine configuration change grinds your business to a halt.

That is what many AWS customers experienced this past April, when Amazon's Northern Virginia data center suffered a glitch and -- to use the technical term -- went totally nutso.

The error started during a network upgrade, when a misrouted traffic shift sent a cluster of Amazon EBS (Elastic Block Store) volumes into a remirroring storm, as they sought out available boxes into which they could insert backups of themselves -- perverse, I know. That set off a series of events that ultimately took down much of the company's U.S. East Region.

That's the short version, anyway -- if you're interested in the full nitty-gritty, clear out 47 hours in your schedule and read Amazon's novel-length explanation.

The problems persisted for about four days. But while many businesses struggled, others such as Netflix took the storm in stride. The key to survival? Designing your systems with these types of failures in mind.

"Our architecture avoids using EBS as our main data storage service, and the SimpleDB, S3, and Cassandra services that we do depend upon were not affected by the outage," Netflix engineers wrote in their "Lessons Netflix Learned From the AWS Outage" blog post. Stateless services and multiple redundant hot copies of data across availability zones were key to avoiding AWS cloud fail pain.

No comments:

Bookmark and Share