Netflix flicks to active-active operations

US video streaming provider Netflix has built out a full replica of its services between two data centres on either sides of mainland USA so it can ensure absolute redundancy of its services.

Chaos Monkey.

Netflix is a video-streaming service that has just about killed off the video rental business in the United States and is oft-cited as a model customer of cloud computing provider Amazon Web Services — so much so that AWS CTO Werner Vogels joked there should be drinking games designed around every time Netflix is cited on stage at AWS events.

Netflix has developed — and open sourced to the wider AWS community — some 35 tools that extend the Amazon Web Services platform for building and maintaining web-scale applications, known as the Netflix platform.

One of the main ways Netflix has extended the AWS platform is by adding features that aim to provide higher levels of availability. Netflix aims for four nines of availability - which means it should not have any more than 53 minutes a year of downtime.

The company has been lauded for its technical feats, among which include bots that randomly kill services to test and validate that the platform self-heals and remains resilient.

The company first introduced Chaos Monkey, and later Chaos Gorilla, to randomly kill off NetFlix services in an entire availability zone to simulate routing issues or power failures.

"Our services are designed around two key principles: isolation, in that changes or outages in one region should not affect services in others; and redundancy; which simply means having more than one of everything - so we distribute services across availability zones," explained Ruslan Meshenberg, director of cloud platform engineering at Netflix.

Learn more about Netflix's transition at the Australian Data Centre Strategy Summit at the Gold Coast Marriott on March 18-20. Tickets are still available.

But for all its efforts, failures still happen. On Christmas Eve 2012, an Amazon Web Services technician made a configuration error that impacted the load balancers managing traffic to AWS's US-East region.

It resulted in an outage for a variety of customers - most crucially, Netflix, demand for which peaks around holiday periods. The multi-hour outage was embarrassing and damaging.

Netflix's engineering team came together to discuss how to cope with the rare occasion that an entire region (such as Australia, US-East or US-West) was suddenly made unavailable.

"We've learned a lot from that experience - and we've re architected our systems for cross-regional failover," Yury Izrailevsky, Netflix's vice president of cloud and platform engineering said.

By July, Neflix had not only developed but also open sourced a solution that aimed to handle any outages in elastic load balancers, dubbed Isthmus.

Meshenberg described Isthmus as a "tunnel to connect banks of Cassandra [database] replicas between the US-East and US-West regions."

In the case of a load balancing issue in one region, Isthmus would step in to force change the DNS to route all traffic to the unaffected region (rather than the usual geo-routing based on which side of the Mississippi River a user request originated from).

Meshenberg's team was tempted to build similar tools for other components of the stack that could be prone to failure — but hesitated on the basis of the time and effort took to build Isthmus.

"Isthmus only works for elastic load balancing failures," he said. "The DNS layer could also be a point of failure, but we realised it is not worth developing a one-off fix for every service."

The company has instead deployed a full stack of services in both the US-East and US-West regions and built an active-active infrastructure for full regional resilience.

"We have introduced cross-regional DNS routing," Izrailevsky said. "In a stable state, customers on the East Coast would be served from Amazon's Virginia data centre and the West Coast from the Oregon data centre. If one region went down, we would simply spin up additional capacity in the other one."

Meshenberg today demonstrated to attendees at AWS' Re:Invent conference how — during failover testing of the new service — Netflix routed 18TB at 9Gbps of inter-region traffic and recorded over a million database reads and writes during the outage without losing any data.

In tandem with the active-active project, Netflix has added yet more monkeys to its famed Simian Army, which may at a later date be open sourced. Among them is the Chaos Kong monkey that randomly simulates the outage of an entire region to test the resiliency of a service.

From a longer term disaster recovery perspective, every data element on the Netflix network is replicated in three availability zones in each cloud region, and routine snapshots are taken of the total storage pool in Amazon S3 and stored in both another region and with a non-Amazon cloud provider.

"With enough speed and scale, everything fails eventually," Meshenberg said. "But the good news is you can do something about it."

Brett Winterford travelled to Re:Invent as a guest of Amazon Web Services.

Defence's chief IT architect moves on

Qantas obtains court order to prevent third-party access to stolen data

Cloudflare makes changes to avoid repeat of 1.1.1.1 DNS outage

TAFE NSW charts $22m ERP upgrade

Dymocks modernises system used to manage 40 million products

Netflix flicks to active-active operations

Bridges Amazon regions - and introduces Chaos Kong.

Partner Content

Sponsored Whitepapers

Events

Most Read Articles

NSW Police seeks lead for 'critical' network uplift

Cloudflare makes changes to avoid repeat of 1.1.1.1 DNS outage

Data theft prompts PlayStation Network outage

NSW Police to embark on $126m IT overhaul

Most popular tech stories

Hungry Jack's stands up Workday for its 30,000 people

Chemist Warehouse builds data maturity to underpin AI goals

TAFE NSW charts $22m ERP upgrade

Village Roadshow uses predictive rostering for its casual workforce

Viva Energy completes greenfield HR setup in time for Coles cutover

NEXTGEN "realigns" structure, announces senior promotions, plans vendor consolidation

Support Fusion launches, rolls out integration platform for MSPs and SIs

Atturra's FY25 revenue to exceed $300m

Ingram Micro "grateful" for customers' support during cybersecurity incident

Macquarie Technology Group secures option to buy Sydney land for data centre campus

Govt launches consumer tech label program for smart devices

Blackberry celebrates "giant step forward"

IoT in Action: How data-driven farming can help feed the world

Major Australian produce supplier will use AI to predict berry yield

Photos: The 4th International Driverless Vehicle Summit in Sydney

Defence's chief IT architect moves on

Qantas obtains court order to prevent third-party access to stolen data

Cloudflare makes changes to avoid repeat of 1.1.1.1 DNS outage

TAFE NSW charts $22m ERP upgrade

Dymocks modernises system used to manage 40 million products

Netflix flicks to active-active operations

Bridges Amazon regions - and introduces Chaos Kong.

Partner Content

Sponsored Whitepapers

Events

Most Read Articles

NSW Police seeks lead for 'critical' network uplift

Cloudflare makes changes to avoid repeat of 1.1.1.1 DNS outage

Data theft prompts PlayStation Network outage

NSW Police to embark on $126m IT overhaul

Most popular tech stories

Hungry Jack's stands up Workday for its 30,000 people

Chemist Warehouse builds data maturity to underpin AI goals

TAFE NSW charts $22m ERP upgrade

Village Roadshow uses predictive rostering for its casual workforce

Viva Energy completes greenfield HR setup in time for Coles cutover

NEXTGEN "realigns" structure, announces senior promotions, plans vendor consolidation

Support Fusion launches, rolls out integration platform for MSPs and SIs

Atturra's FY25 revenue to exceed $300m

Ingram Micro "grateful" for customers' support during cybersecurity incident

Macquarie Technology Group secures option to buy Sydney land for data centre campus

Govt launches consumer tech label program for smart devices

Blackberry celebrates "giant step forward"

IoT in Action: How data-driven farming can help feed the world

Major Australian produce supplier will use AI to predict berry yield

Photos: The 4th International Driverless Vehicle Summit in Sydney

Log In