Best Description of the Likely Cascading Failure That Took Out EC2

Let’s think of a failure mode here: Network congestion starts making your block storage environment think that it has lost mirrors, you begin to have resilvering happen, you begin to have file systems that don’t even know what they’re actually on start to groan in pain, your systems start thinking that you’ve lost drives so at every level from the infrastructure service all the way to “automated provisioning-burning-in-tossing-out” scripts start ramping up, programs start rebooting instances to fix the “problems” but they boot off of the same block storage environment.

You have a run on the bank. You have panic. Of kernels. Or language VMs. You have a loss of trust so you check and check and check and check but the checking causes more problems.

via On Cascading Failures and Amazon’s Elastic Block Store « Joyeur.

Closing in on 36 hours since this melt down began, Amazon has still not been able to restore all of the EC2 instances and EBS volumes that where knocked offline in the #SkynetMassacre. This article is the best explanation of what most likely happened. And the scary part is that it will happen again. And again.

Sadly, there is not a lot to do but try and build enough redundancy into your systems to survive this sort of thing. But it is likely that building that redundancy is going to bring about another melt down at some point. Guess I’ll just need to keep thinking about how to deal with this sort of thing.

Using EC2 to (re)Distribute “Repurposed Virtualities”

They give everyone the power to create their own version of Windows and share it with others. Granted, that’s not the kind of thing too many non-techies, or even techies, wake up in the morning with an overwhelming desire to do. But why not? I’m still getting used to the idea of creating my own versions of Windows, haven’t even released anything yet. But since everything I’m building is open source, there’s no reason someone couldn’t take my package, make some changes, and then redistribute it with their customizations. Trust obviously becomes a pretty important issue here.

via Scripting News: Caprica and repurposed virtualities.

This sounds a lot like what already goes on in the EC2 community around public AMIs. If you take a look at the list of public AMIs (over 4,000 at this point) you’ll see that many are bundles of the OS with one or more application packages. For examlple Drupal and Asterisk AMIs are easy to find. Most the public AMIs are Linux-based, but nearly 300 use Windows as the core OS.
I’ve done this sort of thing building a Linux AMI that started with a base image from RightScale to which I added Apache, MySQL, PHP, Drupal, and more configured to work together. Once I saved the AMI, I had a proto-type server that I could use to quickly scale up our web cluster. I’ve also shared the AMI with colleagues interested in getting started with Drupal.
So, having an AMI that contains Dave’s work is certainly doable and an excellent idea. I, for one, would certainly try it out.