There was a mighty interesting piece on yesterday’s Network World entitled “Three Lessons From Netflix On How to Live in the Cloud”. You could certainly think of worse people to learn such lessons from! Pretty much Netflix’s entire customer-facing services are run in Amazon Web Service’s public cloud, which services Netflix’s whopping 38 million members with literally billions of hours of streamed content every month.
Here are the author’s Netflix-derived recommendations:
- Create micro-services “One Netflix goal is to create the smallest level of abstraction as possible for each application to minimize the effect of any downtime or service failure in the cloud. If this is done successfully, it drastically reduces the “blast radius” of any cloud outage, says Tseitlin, who’s responsible for building out the company’s cloud and ensuring its reliability.For example, if Netflix’s personalization service goes down, then the company defaults to a more generic recommended movies list that will suggest the most popular titles, but not necessarily those personalized to the user. That minimizes the snowball effect of one service bringing down others.”
- Build in redundancy “It’s one thing to have functionality of applications and services deployed to the cloud at granular levels, it’s another to scale it and ensure it works all the time. That’s why Netflix has horizontally scaled its service across the globe. Each service is deployed to at least three Availability Zones (AZ), which are isolated locations within Amazon’s cloud. AWS recommends deploying to at least two AZs for its service-level agreement (SLA) to kick in. Not only are Netflix services deployed to three AZs, but they are each scaled independently so that if an AZ fails then load balancers migrate traffic to the healthy AZ. In addition to scaling to multiple AZs, the entire Netflix service is replicated across two regions within Amazon’s cloud – both U.S. East and EU West – and replicated asynchronously. The idea is that if an entire region in Amazon’s cloud were to fail then the service would still be available.”
- Be resilient “Even with monitoring and alerts that cover the entire operations of Netflix, failures will still happen. That’s why the company has built a platform for monitoring its service and fixing mistakes. The Simian Army is a series of open source tools that have been developed internally by Netflix that test the fault tolerance of the company’s operations. Chaos Monkey is one that randomly kills various services to test failure at the application layer. Chaos Gorilla is another that brings down an entire AZ to test for high availability. Chaos Kong is a service in development that Netflix hopes to use to eventually test an entire region shutting down. Tseitlin says that Netflix is so concerned with testing and monitoring that it jokingly refers to itself as a monitoring company that occasionally delivers movies.”