Westpac staff voluntarily switched off ATM, EFTPOS and Online Banking services yesterday morning, iTnews can reveal, to avert a potentially far more severe outage.
The bank’s Automatic Teller Machine, EFTPOS and Online banking services were cut yesterday morning after the failure of an air conditioning unit at Westpac’s Ryde (Sydney) data centre, first noticed at 5am.
ATM and EFTPOS services were back online by 11am, but online banking wasn't available until 4:30pm.
Whilst Westpac won’t be able to provide a post-incident report until next week, a spokesman for the company today explained to iTnews why engineers made the agonising choice to switch off the services.
Upon discovering the cooling fault at 5am, IT engineers at the data centre were faced with the choice of leaving the servers and storage operating at dangerous temperatures – which could have resulted in a far more serious meltdown, executing the bank’s business continuity plan and shifting workloads to another facility, or switching the machines off until the air conditioning unit could be replaced.
The first option could have exposed Westpac to days or weeks of outages and the potential for data corruption or lost data.
The second option, switching to a secondary disaster recovery facility, was deemed to take too long.
The Westpac spokesman said engineers considered that it would take far less time to switch off the machines, wait for a third party to swap out the cooling units (the building is owned by Mirvac, IT infrastructure outsourced to IBM) and reboot.
The right call in the wrong situation?
The key question for Westpac’s board: why would its disaster recovery plan take so long to execute?
iTnews has discussed the build of ‘active-active’ data centre configurations – where ‘warm’ servers in secondary facilities can take on workloads from production systems within shorter time frames than the five plus hours Westpac took to bring EFTPOS and ATM back online or the eight hours plus to bring back online banking.
Varghese Jacob, designer of data centres for many blue-chip Australian companies, stressed that the industry "expects disaster recovery rollover times to be fast - a matter of a few minutes or hours."
"It shouldn't be quicker to shut down and reboot," he said.
Whilst Varghese can't speak for Westpac, he said often organisations don't regularly test the business continuity plans in place.
In this case, Westpac’s engineers are likely to have made the right call. But they would have good cause to turn around to the bank’s management and ask why it hadn’t put aside some of its $4 billion profits into the best business continuity money can buy.
Surely availability is secondary only to security in terms of the bank’s priorities.