US video streaming provider Netflix has built out a full replica of its services between two data centres on either sides of mainland USA so it can ensure absolute redundancy of its services.
Netflix is a video-streaming service that has just about killed off the video rental business in the United States and is oft-cited as a model customer of cloud computing provider Amazon Web Services — so much so that AWS CTO Werner Vogels joked there should be drinking games designed around every time Netflix is cited on stage at AWS events.
Netflix has developed — and open sourced to the wider AWS community — some 35 tools that extend the Amazon Web Services platform for building and maintaining web-scale applications, known as the Netflix platform.
One of the main ways Netflix has extended the AWS platform is by adding features that aim to provide higher levels of availability. Netflix aims for four nines of availability - which means it should not have any more than 53 minutes a year of downtime.
The company has been lauded for its technical feats, among which include bots that randomly kill services to test and validate that the platform self-heals and remains resilient.
The company first introduced Chaos Monkey, and later Chaos Gorilla, to randomly kill off NetFlix services in an entire availability zone to simulate routing issues or power failures.
"Our services are designed around two key principles: isolation, in that changes or outages in one region should not affect services in others; and redundancy; which simply means having more than one of everything - so we distribute services across availability zones," explained Ruslan Meshenberg, director of cloud platform engineering at Netflix.
But for all its efforts, failures still happen. On Christmas Eve 2012, an Amazon Web Services technician made a configuration error that impacted the load balancers managing traffic to AWS's US-East region.
Netflix's engineering team came together to discuss how to cope with the rare occasion that an entire region (such as Australia, US-East or US-West) was suddenly made unavailable.
"We've learned a lot from that experience - and we've re architected our systems for cross-regional failover," Yury Izrailevsky, Netflix's vice president of cloud and platform engineering said.
By July, Neflix had not only developed but also open sourced a solution that aimed to handle any outages in elastic load balancers, dubbed Isthmus.
Meshenberg described Isthmus as a "tunnel to connect banks of Cassandra [database] replicas between the US-East and US-West regions."
In the case of a load balancing issue in one region, Isthmus would step in to force change the DNS to route all traffic to the unaffected region (rather than the usual geo-routing based on which side of the Mississippi River a user request originated from).
Meshenberg's team was tempted to build similar tools for other components of the stack that could be prone to failure — but hesitated on the basis of the time and effort took to build Isthmus.
"Isthmus only works for elastic load balancing failures," he said. "The DNS layer could also be a point of failure, but we realised it is not worth developing a one-off fix for every service."
The company has instead deployed a full stack of services in both the US-East and US-West regions and built an active-active infrastructure for full regional resilience.
"We have introduced cross-regional DNS routing," Izrailevsky said. "In a stable state, customers on the East Coast would be served from Amazon's Virginia data centre and the West Coast from the Oregon data centre. If one region went down, we would simply spin up additional capacity in the other one."
Meshenberg today demonstrated to attendees at AWS' Re:Invent conference how — during failover testing of the new service — Netflix routed 18TB at 9Gbps of inter-region traffic and recorded over a million database reads and writes during the outage without losing any data.
In tandem with the active-active project, Netflix has added yet more monkeys to its famed Simian Army, which may at a later date be open sourced. Among them is the Chaos Kong monkey that randomly simulates the outage of an entire region to test the resiliency of a service.
From a longer term disaster recovery perspective, every data element on the Netflix network is replicated in three availability zones in each cloud region, and routine snapshots are taken of the total storage pool in Amazon S3 and stored in both another region and with a non-Amazon cloud provider.
"With enough speed and scale, everything fails eventually," Meshenberg said. "But the good news is you can do something about it."
Brett Winterford travelled to Re:Invent as a guest of Amazon Web Services.