This Netflix blog post starts off sounding like an April Fools prank:
We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach.
We have written about our Simian Army in the past and we are now proud to announce that the source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.
Do you think your applications can handle a troop of mischievous monkeys loose in your infrastructure? Now you can find out.
But it turns out it’s not a prank. And it’s actually pretty neat. It’s a script that runs during standard business hours (to make sure people are in the office), randomly killing instances in Netflix’s Auto-Scaling Groups on EC2. But Chaos Monkey isn’t to cause headaches. Or maybe it is, but in the short term. By forcing random failures when plenty of people are on call and no one is being woken up in the middle of the night, Netflix has been able to find all the little bits of a high-available system that aren’t actually highly-available: the type of things you normally only discover when everything goes catastrophically wrong.
Can your infrastructure handle the Chaos Monkey?