NAB deploys Chaos Monkey to kill servers 24/7

By on
NAB deploys Chaos Monkey to kill servers 24/7

Engineers allowed full night's sleep.

The National Australia Bank has deployed the Netflix-developed 'Chaos Monkey' tool on a 24/7 basis to give its website development team some relief from needing to respond to server emergencies outside of work hours.

The application was developed by Netflix to constantly test the resiliency of its Amazon-based infrastructure, and randomly kill severs within its architecture to make sure it has the ability to compensate for the failure.

NAB migrated the public-facing areas of its website to the AWS public cloud in September last year.

Speaking at the Amazon Web Services Sydney summit today, the bank's head of digital and online channel services, David Broeren, said the effort was aimed as much at staff resiliency as IT resiliency.

"There are tens of billions of dollars that go through the bank every day, it is a very stressful job, so if there is anything I can do to make that job easier I will," he said.

Chaos Monkey runs directly on the production environment, which Broeren said is the only way to get the full effect of the tool.

"We have it going 365 days a year, 24/7. It is running now - it could be killing a server as we speak."

Joining the NAB menagerie is the 'Bees with Guns' load testing tool, which Broeren and his team use in their development environment to ensure new releases can cope with "brute force" caused by spikes in demand.

The AWS cloud alerting tool then triggers an automatic scaling out of resources available to the website to deal with the increase.

"From there it's pretty simple, you take the bees away and Amazon tethers us back to where we started."

The new tools have allowed NAB to remove the monitoring thresholds that would flash orange when servers began to struggle, and cause phones to start ringing at all hours of the day.

"Autoscale, plus Chaos Monkey, actually takes something that would tradtitionally be a high severity incident - that is the loss of a server - and turns it into a [much less worrying] information incident."

"It has allowed us to give that time back and that is the investment into a resilient workforce," he said. "We have given our people back a quality of life that they didnt have."

Got a news tip for our journalists? Share it with us anonymously here.
Copyright © . All rights reserved.

Most Read Articles

Log In

  |  Forgot your password?