How Facebook ships software at warp speed

Australian-raised engineer Joel Pobar is part of a team that "performs heart surgery" on the software infrastructure that runs Facebook.com every two weeks.

Pobar's job is to incrementally improve the HipHop Virtual Machine (HHVM) - a programming language virtual machine built by Facebook which is akin to Java or .Net.

Crudely speaking, HHVM was built to ensure Facebook's servers wouldn't melt under the load of millions of simultaneous users. It took two years and three competing teams to build, and even today is never entirely 'stable' in the sense that the project to improve it never ends.

Pobar's team updates the HHVM at least once every two weeks, and the HHVM is more or less overhauled completely every 12 months.

"Imagine," Pobar described for iTnews' IT manager readers, "building a new version of Java every two weeks, and deploying that to every one of your servers."

Pobar - who had previously developed software for both Sun Microsystems and Microsoft in what was arguably their halcyon days - had never experienced the demands and agility of Facebook's dev and test process. This week he returns to Australia to share his insights into how local organisations might learn from the experience.

His presentation, "Move fast and ship things", will be heard for the first time today during the Melbourne leg of the YOW developer conference.

"I'm here to bring home what I've learned about shipping software at lightning speed," he told iTnews in the weeks prior to the event.

Speed culture

HHVM is but one tool of many that helps Facebook's software developers ship code up to twice a day to tens of thousands of servers across several data centres.

Changes to Facebook.com tend to be small and incremental, so much so that at times thousands of users are piloting them unawares. But always, Facebook's team is observing and measuring to determine which features should be rolled out to its billion-plus users and which should quietly disappear.

Facebook boasts one of the best resourced software teams in the world, and some of the dev and test tools it has developed in-house could arguably be successful products in their own right. Indeed, several, like HHVM, have been released on GitHub as open source projects. But in Pobar's view, the speed at which Facebook improves its online services is as much a question of culture as resources.

First, he said, Facebook management don't micro-manage the development process or prescribe how new features should be built, but instead give consistent, clear messages to the team about where they would like the product to move.

"They are accountable for moving the needle in the right direction for the company," he said. "It is always clearly articulated how our infrastructure goals translate into Facebook's vision."

Facebook also tolerates a degree of failure. Goals are set high enough that - as often as 50 percent of the time - developers might miss them.

Further, project teams are at times set up to attempt diverging solutions to a stated problem, all in the knowledge that the slowest of them at any stage will be asked to peel off to work on something more impactful. The problem HHVM was built to solve was also attacked by two competing projects - the authors of which would feel no shame about conceding defeat in the face of a superior option.

To reinforce this culture, every two to three months developers down tools for a hackathon and are asked to build a prototype for a new idea from the ground up. Many of these ideas make it into production. Software engineers might also be plucked out of one project and placed in another for short bursts of activity, simply to keep an open mind to different ways of approaching a problem.

Joel Pobar addresses his fellow developers

"We embrace experimentation," Pobar said. "In order to do experiments, you need fast, effective feedback loops. You need to not only tolerate but embrace failure.

"I remember being scared of that concept at first - by the second day of boot camp I was told I will ship code this week to 600 million users. That is a frightening thing to hear."

What tools does Pobar's team use to achieve this speed? Read on to find out...

The tools for speed

Facebook invests significant resources in its developers.

There is no expense spared for the test and dev servers - the last batch boasted 24-cores, 144GB of RAM and 2.7 TB of FusionIO flash storage.

Jobs are scheduled on a simple Facebook application which is "about as lightweight" as possible.

"There is no defined process - it is organic and up to the team," Pobar said.

As developers write a feature on these local dev machines, they can test its impact on the latest version of Facebook.com (literally using latest.facebook.com in the browser) in real-time by simply pressing the F5 key. HHVM dynamically recompiles the page on the fly.

When a developer wants to ship a change (a 'differential' or just 'diff'), the new code is shared with peers in the engineering team for testing and feedback. If its generally agreed to be a sound change, the code is committed as a new branch to Facebook's trunk, ready for the weekly ship of code into production every Tuesday.

In the two days leading up to that Tuesday "push", other Facebook employees can run the changes on 'latest.Facebook.com' and file bug reports, check for errors in the logs, and provide other feedback before any member of the public sees the code in production.

But while code might run fine in dev and test, it might not run as soundly when facing the onslaught of millions of users. Facebook have developed two key A/B testing tools and processes to see how changes might fare in the real world.

The most formidable of them is Perflab. Perflab allows developers to simulate the traffic load of 24-hours of users against Facebook.com with the new change up and running in a drastically lower time.

"Perflab takes a snapshot of the last 24 hours of traffic from real users on Facebook.com, anonymises it, and throws it at 1000 machines," Pobar said. "You basically replay the last 24 hours of traffic on Facebook with your new diffs.

"We measured that 24 hours of traffic and 1000 servers was what was required for a test to give us statistically relevant data. To test a change on a globally-distributed application, you need to capture the curve of traffic as each of the US, Asia and Europe hits their usage peaks. Users in Asia, for example, tend to spend more time on mobile phones rather than PC browsers - you need to test your change against a wide range of request types. We also use a mixture of machine [server] types to replicate how our data centre might react to the change - perhaps it performs better for x percentage of them, worse for y percent. Performance improvement or not, if the new HHVM you are testing can tolerate the PerfLab traffic, that's a very good indicator it can work in production. If your diff is bad, PerfLab is going to tell you."

Pobar said PerfLab can currently replay 24 hours of Facebook traffic in between 45 and 60 minutes, but that the team is working to reduce that window further over time.

An even craftier tool is 'Gatekeeper' which allows an engineer or a team to pilot a new feature in a specific geography, demographic or group before rolling it out to a billion active users.

"We have all sorts of tools to determine how new features are performing - lots of logging and graphs. But the ultimate test is whether people are actually using it," Pobar said.

"New Zealand is often the country of choice to turn features on and experiment - it is a mini-subset of the world in a lot of ways. We collect data on use, share it with the team and ask ourselves: are we moving the needle in the right direction? Then its rinse, repeat, iterate."

Australia includes Reddit, Kick in teen social media ban

New ATO division to de-risk major system transformations

Samsung triple zero handset firmware mystery deepens

NSW DCCEEW makes CDIO its top tech role

Uniting uses GenAI to cut admin burden for frontline care workers

How Facebook ships software at warp speed

Great tools, and the right culture.

Partner Content

Sponsored Whitepapers

Events

Most Read Articles

NSW Office for AI appoints its first director, looks for 13 more staff

Home Affairs streamlines risk vetting for gov tech suppliers

Palantir sues engineers who left to form 'copycat' AI firm

Microsoft and OpenAI reach deal

Most popular tech stories

Suncorp creates a "clear execution roadmap" for agentic AI

Westpac Intelligence Layer breaks cover

Coles to transform finance as 'cloud ERP' program evolves

Qantas' digital and customer head steps down

Coles offers ChatGPT Enterprise to corporate staff

HamiltonJet partners with digital services provider Fortude

SentinelOne signs distribution agreement with Sektor

Rapid7’s new SIEM combines exposure management with threat detection

The techpartner.news podcast, episode 3: Why security consultancy founder Kat McCrabb started with the hard stuff

Bluechip Infotech enters final stage of Goodson Imports acquisition

Blackberry celebrates "giant step forward"

Photos: Australian industry explores data for net zero

'Touch-free' smartphone controlled with head movements

Photos: the IoT in Action event in Sydney

Govt launches consumer tech label program for smart devices

Australia includes Reddit, Kick in teen social media ban

New ATO division to de-risk major system transformations

Samsung triple zero handset firmware mystery deepens

NSW DCCEEW makes CDIO its top tech role

Uniting uses GenAI to cut admin burden for frontline care workers

How Facebook ships software at warp speed

Great tools, and the right culture.

Partner Content

Sponsored Whitepapers

Events

Most Read Articles

NSW Office for AI appoints its first director, looks for 13 more staff

Home Affairs streamlines risk vetting for gov tech suppliers

Palantir sues engineers who left to form 'copycat' AI firm

Microsoft and OpenAI reach deal

Most popular tech stories

Suncorp creates a "clear execution roadmap" for agentic AI

Westpac Intelligence Layer breaks cover

Coles to transform finance as 'cloud ERP' program evolves

Qantas' digital and customer head steps down

Coles offers ChatGPT Enterprise to corporate staff

HamiltonJet partners with digital services provider Fortude

SentinelOne signs distribution agreement with Sektor

Rapid7’s new SIEM combines exposure management with threat detection

The techpartner.news podcast, episode 3: Why security consultancy founder Kat McCrabb started with the hard stuff

Bluechip Infotech enters final stage of Goodson Imports acquisition

Blackberry celebrates "giant step forward"

Photos: Australian industry explores data for net zero

'Touch-free' smartphone controlled with head movements

Photos: the IoT in Action event in Sydney

Govt launches consumer tech label program for smart devices

Log In