Video streaming giant Netflix operates on a staggering scale.
At its peak, it accounts for 37 percent of all US internet traffic.
Its cloud-based operations autoscale up and down at a rate equivalent to NASA’s Pleiades supercomputer - 250,000 CPU cores every two days.
It runs 100,000 EC2 instances in the AWS public cloud. Its AWS bill alone is 800 million lines long - running into the hundreds of gigabytes of data every month.
Its 86 million customers in 190 countries watch 150 million hours of streaming video every day.
Netflix’s army of developers work at the extremes of what is possible in global-scale software engineering every day, and have built up a reputation for being some of the best in terms of high-velocity development and the open source tools that support it.
They were out in force at AWS’ Re:Invent summit in Las Vegas last month. Here are some of the tips they had to share.
Don’t have a data centre
Netflix has just unplugged its last data centre, the final phase in a huge migration to the cloud that began back in 2008. The company has since become of the biggest consumers of AWS public cloud services.
Dave Hahn, one of Netflix’s senior engineers, said turning its back on running data centres had allowed Netflix to focus solely on things that make the service better for its customers.
“Better selection algorithms. Better content. Better device support," he said.
“None of these are helped by running data centres.
Never assume (blow things up instead)
According to Hahn, it is critical for Netflix to know at all times that it can respond to whatever might happen, before 86 million angry customers turn to social media to complain about a sudden entertainment shortage.
“I hate to break it to you but you are going to lose instances,” he said.
“So go turn one off. Watch how your application behaves."
This ethos has produced one of Netflix’s most famous open source tools - Chaos Monkey.
Netflix runs Chaos Monkey 24 hours a day, seven days a week in its production environment. The tool races around automatically shutting down random AWS instances to see whether its applications can cope.
Hahn said it had only caused the company one serious headache in three years of constant operation and tens of millions of terminations.
The dev team has more recently added a bigger, more imposing member to the Chaos jungle.
Chaos Kong “virtually” blows up whole Amazon regions for Netflix. It runs once a month in production at the push of a button.
“There are times where you have been watching Netflix and we have evacuated a whole region,” Hahn said.
His colleague Andrew Glover called the Chaos suite “the definition of not assuming reliability”.
“We make no assumptions that if our service gets shot in the head that it is actually going to keep working - so we test it all the time in production," Glover said.
The team has now added ChAP - its Chaos Automation Platform - to the menagerie.
ChAP assesses how latency can affect applications by spinning up identical clusters running new code and rerouting a portion of Netflix traffic to both. It can then study what happens when it makes adjustments in one cluster, compared to the real-life control.
Velocity is better than uptime
All Netflix’s engineers have access to its production environment, and are allowed to code in whatever language they want, using whatever platforms they want.
“All of us at Netflix have probably done something that has broken something in production,” Glover said.
But integral to the organisation’s culture is putting the pace of innovation ahead of the risk that something might go wrong.
“We will not slow down our engineers from deploying or trying things by waving the uptime flag,” Hahn said.
The result is Netflix makes roughly 400 significant changes to its production environment every day.
“Netflix considers speed to be a strategic advantage in the business we’re in," Hahn said.
“We hire hire smart people, we expect them to do smart things - and then we get out of their way."
Read on to find out what Netflix does when bad code gets through its defences...