InfoStor: The Amazon EC2 Outage: Lessons Learned — by Russ Fellows

Thursday, May 19th 2011

By now most of us have heard about the Amazon EC2 outage that began on April 21st 2011.  As typical, there will be finger pointing, placing blame in the weeks to come, and probably lawsuits.  In the end, some people and businesses will concur that utilizing “The Cloud” is just too risky.  I believe this is the wrong conclusion to take away from this event for several reasons.

The first is that Cloud services in general and Amazon in particular have been phenomenally successful over the past three years because it enables new business models that were previously impossible.  Companies both large and small are increasingly turning to Cloud based services to provide a significant portion of their IT infrastructure.  Some companies proudly state that they could not exist without the services that providers like Amazon provide.

The reason Amazon’s outage is big news is because so many companies now rely upon these type of services to begin with, and many have chosen Amazon specifically.  As of July 2009, Amazon boasted over 1,400 companies using their EC2 services for business critical operations.  Since that time, the number of companies using the Cloud has exploded exponentially.

The second reason we shouldn’t blame cloud providers is that they are not responsible for how their services are used.  Some have argued that providers like Amazon should be aware of the services they are hosting and prevent “important” applications from using Amazon Web Services (AWS) to begin with.  Although this rationale seems appealing, it withers under scrutiny.

The same argument could be made for ISP’s, Telco’s and even utility service providers.  Are the service providers to blame if those relying upon their services don’t adequately plan?  Should utilities prevent a hospital from using their telephone or electric services if they don’t have contingency communication or power plans in case of a disaster?

Most people and most courts have decided, “No”, these entities are in fact not responsible for the inability of others to plan for disasters.  Internet service providers are not responsible for misuse of data flowing across their networks, nor are Telco’s responsible for criminals utilizing telephones to communicate.

Finally, the most important reason why we shouldn’t be too quick to judge Amazon and other cloud providers is because it shows us once again that people are the critical link.  This and other outages show us that there is no substitute for smart people and Disaster Recovery planning.

There are thousands of examples of companies utilizing Amazon EC2 who remained up during the most recent event.  Those who choose to use Amazon’s high availability features such as automated failover and alternate availability zones were able to continue operating.  Some high-profile companies running exclusively on Amazon EC2 include Netflix, and literally thousands of others who remained online.

Cloud computing does not alleviate the need for planning; instead it accentuates the need for planning.  IT architects, CTO’s and CIO’s importance only increases with the move to cloud computing.

This is not the first outage a service provider has suffered, nor will it be the last.  Data centers fail, this shouldn’t be news to anyone, least of all IT professionals.  The only difference is that a single event exposed many poorly designed applications at the same time.

To quote Amazon’s own design guidelines, “A well designed architecture built on top of EC2 keeps important information (databases, log files, etc) in easy to manage persistent and redundant data stores which can be snapshotted, duplicated, detached, and attached to new servers.”

The only real failure is the failure to plan for a local outage.  Amazon offers both geographic regions, and availability zones within those regions.  Specifically in the North American geography, Amazon provides three zones in the East, and three zones in the West.  The recent outage involved systems running in only one zone, US-East-1.

Well-run companies have been designing high-availability systems for decades.  Amazon provides nearly all the tools needed to provide fault tolerant systems, but it is still up to the companies using these services to properly plan for contingencies, and implement an IT architecture that is resilient and fault tolerant.

The Amazon EC2 Outage — Final Thoughts

Lessons sometimes need to be learned the hard way.  This will certainly be a watershed moment for those who haven’t learned the importance of IT architecture and planning.  The outage at Amazon shows how rapidly companies of all sizes have adopted these services.

Cloud and other IT service providers enable new business models, while allowing existing businesses to streamline their operations.  In the new world, IT is no longer a capital-intensive function that must be provided in house.  Cloud service providers enable choice and flexibility, but with those choices come responsibility.

The fact remains that in order to deliver mission critical IT, companies must find and retain talented IT personnel.  Dynamic business will increasingly rely upon IT and the professionals who can deliver these services.

In the end, it is the people who matter most; isn’t that always the lesson?

Connect with us on Twitter and LinkedIn

Forgot your password? Reset it here.