Netflix flicks to active-active operations

US video streaming provider Netflix has built out a full replica of its services between two data centres on either sides of mainland USA so it can ensure absolute redundancy of its services.

Chaos Monkey.

Netflix is a video-streaming service that has just about killed off the video rental business in the United States and is oft-cited as a model customer of cloud computing provider Amazon Web Services — so much so that AWS CTO Werner Vogels joked there should be drinking games designed around every time Netflix is cited on stage at AWS events.

Netflix has developed — and open sourced to the wider AWS community — some 35 tools that extend the Amazon Web Services platform for building and maintaining web-scale applications, known as the Netflix platform.

One of the main ways Netflix has extended the AWS platform is by adding features that aim to provide higher levels of availability. Netflix aims for four nines of availability - which means it should not have any more than 53 minutes a year of downtime.

The company has been lauded for its technical feats, among which include bots that randomly kill services to test and validate that the platform self-heals and remains resilient.

The company first introduced Chaos Monkey, and later Chaos Gorilla, to randomly kill off NetFlix services in an entire availability zone to simulate routing issues or power failures.

"Our services are designed around two key principles: isolation, in that changes or outages in one region should not affect services in others; and redundancy; which simply means having more than one of everything - so we distribute services across availability zones," explained Ruslan Meshenberg, director of cloud platform engineering at Netflix.

Learn more about Netflix's transition at the Australian Data Centre Strategy Summit at the Gold Coast Marriott on March 18-20. Tickets are still available.

But for all its efforts, failures still happen. On Christmas Eve 2012, an Amazon Web Services technician made a configuration error that impacted the load balancers managing traffic to AWS's US-East region.

It resulted in an outage for a variety of customers - most crucially, Netflix, demand for which peaks around holiday periods. The multi-hour outage was embarrassing and damaging.

Netflix's engineering team came together to discuss how to cope with the rare occasion that an entire region (such as Australia, US-East or US-West) was suddenly made unavailable.

"We've learned a lot from that experience - and we've re architected our systems for cross-regional failover," Yury Izrailevsky, Netflix's vice president of cloud and platform engineering said.

By July, Neflix had not only developed but also open sourced a solution that aimed to handle any outages in elastic load balancers, dubbed Isthmus.

Meshenberg described Isthmus as a "tunnel to connect banks of Cassandra [database] replicas between the US-East and US-West regions."

In the case of a load balancing issue in one region, Isthmus would step in to force change the DNS to route all traffic to the unaffected region (rather than the usual geo-routing based on which side of the Mississippi River a user request originated from).

Meshenberg's team was tempted to build similar tools for other components of the stack that could be prone to failure — but hesitated on the basis of the time and effort took to build Isthmus.

"Isthmus only works for elastic load balancing failures," he said. "The DNS layer could also be a point of failure, but we realised it is not worth developing a one-off fix for every service."

The company has instead deployed a full stack of services in both the US-East and US-West regions and built an active-active infrastructure for full regional resilience.

"We have introduced cross-regional DNS routing," Izrailevsky said. "In a stable state, customers on the East Coast would be served from Amazon's Virginia data centre and the West Coast from the Oregon data centre. If one region went down, we would simply spin up additional capacity in the other one."

Meshenberg today demonstrated to attendees at AWS' Re:Invent conference how — during failover testing of the new service — Netflix routed 18TB at 9Gbps of inter-region traffic and recorded over a million database reads and writes during the outage without losing any data.

In tandem with the active-active project, Netflix has added yet more monkeys to its famed Simian Army, which may at a later date be open sourced. Among them is the Chaos Kong monkey that randomly simulates the outage of an entire region to test the resiliency of a service.

From a longer term disaster recovery perspective, every data element on the Netflix network is replicated in three availability zones in each cloud region, and routine snapshots are taken of the total storage pool in Amazon S3 and stored in both another region and with a non-Amazon cloud provider.

"With enough speed and scale, everything fails eventually," Meshenberg said. "But the good news is you can do something about it."

Brett Winterford travelled to Re:Invent as a guest of Amazon Web Services.

Confusion reigns as phishers abuse Exchange Online Direct Send

OVIC sets limits on GenAI tool use in external meetings

Google commits US$1 billion for AI training at US universities

Health expands cloud footprint with $32m Azure deal

Breaking Down Data Silos in the Age of AI: How Hitachi Vantara Sees The A/NZ Opportunity Evolving

Netflix flicks to active-active operations

Bridges Amazon regions - and introduces Chaos Kong.

Partner Content

Sponsored Whitepapers

Events

Most Read Articles

RACV rolls refunds for direct debit disaster, can’t say when funds will clear

Cloudflare makes changes to avoid repeat of 1.1.1.1 DNS outage

Health signs $33m networks deal with Optus

RBA agrees $37m deal for head office data centre replacement

Most popular tech stories

Bunnings pilots AI for its 55,000-strong workforce

David Jones seeks '360-degree' customer view from unified data

Guzman y Gomez continues to deploy Workday as core people platform

CBA AI 'voice bot' deployment linked to review of 45 roles

OVIC sets limits on GenAI tool use in external meetings

HamiltonJet partners with digital services provider Fortude

SentinelOne signs distribution agreement with Sektor

Rapid7’s new SIEM combines exposure management with threat detection

The techpartner.news podcast, episode 3: Why security consultancy founder Kat McCrabb started with the hard stuff

Bluechip Infotech enters final stage of Goodson Imports acquisition

Blackberry celebrates "giant step forward"

Photos: IoT Impact 2023 brings together data-enabled productivity, sustainability and trust opportunities

Lendlease lays out property-tech wish list

Govt launches consumer tech label program for smart devices

How Jemena uses IoT to push the boundaries of customer innovation in the gas sector

Confusion reigns as phishers abuse Exchange Online Direct Send

OVIC sets limits on GenAI tool use in external meetings

Google commits US$1 billion for AI training at US universities

Health expands cloud footprint with $32m Azure deal

Breaking Down Data Silos in the Age of AI: How Hitachi Vantara Sees The A/NZ Opportunity Evolving

Netflix flicks to active-active operations

Bridges Amazon regions - and introduces Chaos Kong.

Partner Content

Sponsored Whitepapers

Events

Most Read Articles

RACV rolls refunds for direct debit disaster, can’t say when funds will clear

Cloudflare makes changes to avoid repeat of 1.1.1.1 DNS outage

Health signs $33m networks deal with Optus

RBA agrees $37m deal for head office data centre replacement

Most popular tech stories

Bunnings pilots AI for its 55,000-strong workforce

David Jones seeks '360-degree' customer view from unified data

Guzman y Gomez continues to deploy Workday as core people platform

CBA AI 'voice bot' deployment linked to review of 45 roles

OVIC sets limits on GenAI tool use in external meetings

HamiltonJet partners with digital services provider Fortude

SentinelOne signs distribution agreement with Sektor

Rapid7’s new SIEM combines exposure management with threat detection

The techpartner.news podcast, episode 3: Why security consultancy founder Kat McCrabb started with the hard stuff

Bluechip Infotech enters final stage of Goodson Imports acquisition

Blackberry celebrates "giant step forward"

Photos: IoT Impact 2023 brings together data-enabled productivity, sustainability and trust opportunities

Lendlease lays out property-tech wish list

Govt launches consumer tech label program for smart devices

How Jemena uses IoT to push the boundaries of customer innovation in the gas sector

Log In