Best Description of the Likely Cascading Failure That Took Out EC2

Let’s think of a failure mode here: Network congestion starts making your block storage environment think that it has lost mirrors, you begin to have resilvering happen, you begin to have file systems that don’t even know what they’re actually on start to groan in pain, your systems start thinking that you’ve lost drives so at every level from the infrastructure service all the way to “automated provisioning-burning-in-tossing-out” scripts start ramping up, programs start rebooting instances to fix the “problems” but they boot off of the same block storage environment.

You have a run on the bank. You have panic. Of kernels. Or language VMs. You have a loss of trust so you check and check and check and check but the checking causes more problems.

via On Cascading Failures and Amazon’s Elastic Block Store « Joyeur.

Closing in on 36 hours since this melt down began, Amazon has still not been able to restore all of the EC2 instances and EBS volumes that where knocked offline in the #SkynetMassacre. This article is the best explanation of what most likely happened. And the scary part is that it will happen again. And again.

Sadly, there is not a lot to do but try and build enough redundancy into your systems to survive this sort of thing. But it is likely that building that redundancy is going to bring about another melt down at some point. Guess I’ll just need to keep thinking about how to deal with this sort of thing.

FDIC Leans on GA Banks to Straighten Up or Face Failure

Seven Georgia banks were issued the Federal Deposit Insurance Corp.’s strongest regulatory rebuke last month, according to an announcement Friday by the banking industry insurer.The banks, concentrated primarily in metro Atlanta, entered into cease-and-desist orders with the FDIC, an agreement that stipulates how the bank must overhaul its business, or face failure.

FDIC ups cease-and-desist orders in Ga. – Atlanta Business Chronicle:

In case anyone was wondering if the economy has hit bottom yet. Personally, I’d go with no, we’re still falling. And hitting the bottom may be about right, it will be a big hit.

Blogged with the Flock Browser