Dev like Netflix: Top tips from the world's savviest engineers

By

Monkeys, canaries and "virtually" blowing stuff up.

Video streaming giant Netflix operates on a staggering scale.

Dev like Netflix: Top tips from the world's savviest engineers

At its peak, it accounts for 37 percent of all US internet traffic.

Its cloud-based operations autoscale up and down at a rate equivalent to NASA’s Pleiades supercomputer - 250,000 CPU cores every two days.

It runs 100,000 EC2 instances in the AWS public cloud. Its AWS bill alone is 800 million lines long - running into the hundreds of gigabytes of data every month.

Its 86 million customers in 190 countries watch 150 million hours of streaming video every day.

Netflix’s army of developers work at the extremes of what is possible in global-scale software engineering every day, and have built up a reputation for being some of the best in terms of high-velocity development and the open source tools that support it.

They were out in force at AWS’ Re:Invent summit in Las Vegas last month. Here are some of the tips they had to share.

Don’t have a data centre

Netflix has just unplugged its last data centre, the final phase in a huge migration to the cloud that began back in 2008. The company has since become of the biggest consumers of AWS public cloud services.

Dave Hahn, one of Netflix’s senior engineers, said turning its back on running data centres had allowed Netflix to focus solely on things that make the service better for its customers.

“Better selection algorithms. Better content. Better device support," he said.

“None of these are helped by running data centres.

Never assume (blow things up instead)

According to Hahn, it is critical for Netflix to know at all times that it can respond to whatever might happen, before 86 million angry customers turn to social media to complain about a sudden entertainment shortage.

“I hate to break it to you but you are going to lose instances,” he said.

“So go turn one off. Watch how your application behaves."

This ethos has produced one of Netflix’s most famous open source tools - Chaos Monkey.

Netflix runs Chaos Monkey 24 hours a day, seven days a week in its production environment. The tool races around automatically shutting down random AWS instances to see whether its applications can cope.

Hahn said it had only caused the company one serious headache in three years of constant operation and tens of millions of terminations.

The dev team has more recently added a bigger, more imposing member to the Chaos jungle.

Chaos Kong “virtually” blows up whole Amazon regions for Netflix. It runs once a month in production at the push of a button.

“There are times where you have been watching Netflix and we have evacuated a whole region,” Hahn said.

His colleague Andrew Glover called the Chaos suite “the definition of not assuming reliability”.

“We make no assumptions that if our service gets shot in the head that it is actually going to keep working - so we test it all the time in production," Glover said.

The team has now added ChAP - its Chaos Automation Platform - to the menagerie.

ChAP assesses how latency can affect applications by spinning up identical clusters running new code and rerouting a portion of Netflix traffic to both. It can then study what happens when it makes adjustments in one cluster, compared to the real-life control.

Velocity is better than uptime

All Netflix’s engineers have access to its production environment, and are allowed to code in whatever language they want, using whatever platforms they want.

“All of us at Netflix have probably done something that has broken something in production,” Glover said.

But integral to the organisation’s culture is putting the pace of innovation ahead of the risk that something might go wrong.

“We will not slow down our engineers from deploying or trying things by waving the uptime flag,” Hahn said.

The result is Netflix makes roughly 400 significant changes to its production environment every day.

“Netflix considers speed to be a strategic advantage in the business we’re in," Hahn said.

“We hire hire smart people, we expect them to do smart things - and then we get out of their way."

Read on to find out what Netflix does when bad code gets through its defences...

Automate everything

Glover’s top tip is to turn everything into an automated deployment pipeline.

“Anytime a task is done more than once, consider that an opportunity to make it a pipeline,” he said.

It is easy to overlook one-off infrastructure management tasks or occasional updates which seem easier to just complete, than to codify in a repeatable process, he said. 

But unless you try to catch them all you risk creating “towers of knowledge” limited to one or two people who have expertise in a particular area - and no-one to turn to when those individuals are unavailable.

“By codifying this automation in the pipelines, anyone can run them with the benefit of consistency," Glover said.

Be ready for bad code

Despite all of its toolsets and best practice standards, the Netflix team acknowledges that sometimes it will deploy bad code into production.

This code will even get past its Automated Canary Analysis (ACA) - Netflix’s “last line of defence” that channels a small amount of traffic to the last deployed version of software and the soon-to-be-deployed new version simultaneously. It compares their performance and gives the new code a score out of 100 in terms of how it behaves.

Engineers can build gates into Netflix’s deployment platform - called Spinnaker - that automate a go or no-go threshold based on these ACA scores.

And even if this goes sideways, Spinnaker features a one-click rollback button that instantly reverts to the old version of software until the new version passes health checks.

“As much customer testing as you can do, occasionally bad code still gets out. When that happens we want to make sure it is as easy to shift back to the good code as fast as possible so we don’t ruin the customer experience,” Hahn said.

"Bad things will happen, you can’t stop them. We try to limit how long it takes to fix them," Glover said.

Blameless post-mortems

Nearly all of the wisdom in Netflix’s deployment practice comes from deep-dive sessions that analyse failures while eschewing finger-pointing.

“We conduct blameless post-mortems so we can learn from mistakes and not repeat them," Glover said.

“Many of [our] best practices ... have come right out of these post-mortems." 

Paris Cowan travelled to AWS Re:Invent as a guest of Amazon Web Services

Multi page
Got a news tip for our journalists? Share it with us anonymously here.
Copyright © iTnews.com.au . All rights reserved.
Tags:

Most Read Articles

QLD government appoints interim CISO

QLD government appoints interim CISO

Transport for NSW restructures tech division

Transport for NSW restructures tech division

Turnbull's Digital Transformation Office to cost $95m

Turnbull's Digital Transformation Office to cost $95m

Coles Group CTO, CDO to leave in early 2025

Coles Group CTO, CDO to leave in early 2025

Log In

  |  Forgot your password?