Westpac: Quicker to reboot than press DR alarm

 

Why Westpac made the right call to switch off services.

Westpac staff voluntarily switched off ATM, EFTPOS and Online Banking services yesterday morning, iTnews can reveal, to avert a potentially far more severe outage.

The bank’s Automatic Teller Machine, EFTPOS and Online banking services were cut yesterday morning after the failure of an air conditioning unit at Westpac’s Ryde (Sydney) data centre, first noticed at 5am.

ATM and EFTPOS services were back online by 11am, but online banking wasn't available until 4:30pm.

Whilst Westpac won’t be able to provide a post-incident report until next week, a spokesman for the company today explained to iTnews why engineers made the agonising choice to switch off the services.

Upon discovering the cooling fault at 5am, IT engineers at the data centre were faced with the choice of leaving the servers and storage operating at dangerous temperatures – which could have resulted in a far more serious meltdown, executing the bank’s business continuity plan and shifting workloads to another facility, or switching the machines off until the air conditioning unit could be replaced.

The first option could have exposed Westpac to days or weeks of outages and the potential for data corruption or lost data.

The second option, switching to a secondary disaster recovery facility, was deemed to take too long.

The Westpac spokesman said engineers considered that it would take far less time to switch off the machines, wait for a third party to swap out the cooling units (the building is owned by Mirvac, IT infrastructure outsourced to IBM) and reboot.

The right call in the wrong situation?

The key question for Westpac’s board: why would its disaster recovery plan take so long to execute?

iTnews has discussed the build of ‘active-active’ data centre configurations – where ‘warm’ servers in secondary facilities can take on workloads from production systems within shorter time frames than the five plus hours Westpac took to bring EFTPOS and ATM back online or the eight hours plus to bring back online banking.

Varghese Jacob, designer of data centres for many blue-chip Australian companies, stressed that the industry "expects disaster recovery rollover times to be fast - a matter of a few minutes or hours."

"It shouldn't be quicker to shut down and reboot," he said.

Whilst Varghese can't speak for Westpac, he said often organisations don't regularly test the business continuity plans in place.

In this case, Westpac’s engineers are likely to have made the right call. But they would have good cause to turn around to the bank’s management and ask why it hadn’t put aside some of its $4 billion profits into the best business continuity money can buy.

Surely availability is secondary only to security in terms of the bank’s priorities.

Copyright © iTnews.com.au . All rights reserved.


Westpac: Quicker to reboot than press DR alarm
Time for an upgrade?
"Poor management, so many businesses don't have effective DR plans. Technology today has some excellent products no matter what vendor to allow it and if your vendor cant them you need to be moving ..."
By pameacs
 
 
 
Comments: 8
daver
May 6, 2011 7:40 AM
The second last paragraph is the punch line. You only get robust enough and "fast" DR infrastructure with the right amount of time, technology and ultimately money put aside for it. Oh and how many CRAC units do they have or should they have and type? Let's not forget UPS quantity and capacity!
Danielrollston
May 6, 2011 10:02 AM
With the right control system on things like chillers remote monitoring of key paramaters as part of a vendor maintenance plan could have forseen any upcoming issues before they happenned... I wonder what control system and other equipment they had in there?
RaTTyRaTT
May 6, 2011 10:32 AM
Actually, I noticed the ATM network up by 11am, but the internet banking was back online by 1pm. I paid some bills then :-)

Still, I'm very happy with the response from Westpac, crappy situation - but good handling. If people can handle a short outage like that - then the world is not going to hell as fast as I thought it was. (living in a gimmie, gimmie society...LOL!!!)

Reading between the lines, I would surmise that the 'failure of an AC' unit was probably not the single cause of this - but what is publically being released. My guess is systematic failure of multiple components - to cause the kind of heat imbalance they are stating would have occurred.
I've seen such things downplayed before by others, to avoid a PR disaster, etc. Worst was when a sparky once sliced the cable (silly bugger) on the wrong side of the UPS circuit - which killed the entire DC power environment. (mind, questions were raised why no redundancy existed there... but that's Govt. for ya!)
It was downplayed to state it was a UPS failure, blamed on the installer - quietly swept under the carpet & things moved on. (New DC later - still same crap... LOL!!!)
RaTTyRaTT
May 6, 2011 10:36 AM
Ironically, heat loads by systems these days is actually higher, if you look at some of the testing that has been done by vendors regarding their equipment. The sweet spot has always been (holy grail) around 22 - 24C = however I have seen numbers around the 28 - 31C and still no impact on functionality.

Mind, the density that Westpac probably has - would push above 40C I would guess. That is also because of SAN storage mostly, blades dump a lot of heat - but nothing that can't be dissappated over time. SAN storage just runs HOT. (Still remember the wonderful warm feeling in the dead of winter, standing behind the SAN racks - greatest place to be during -7C nights & 0 - 6C day's...

:-)
Bob
May 6, 2011 2:09 PM
Westpac would have thousands of branches and ATMs and these would access the data centre by a network along with links to other financial institutions. Invoking a DR plan for a major bank would be a big call.

If you are going to pull the plugs and move them to another site that's going to take a long time, even assuming the the DR centre was ready to go. At some stage in the future you are also go to need to bring it all back to the main centre resulting in another outage.

A disaster is a more like a complete loss of the facility like an earth quake or explosion destroying it, where you are not coming back. A disaster is not someone putting the wrong milk in your latte.
laticslad
May 6, 2011 3:17 PM
In todays age, this is completely unacceptable, Facebook nor Google would be unavailable because of an air conditioner DC problem, they would have seamless switch over to other DC's. For a major bank in Australia that has just released a $4Billion profit not to have such a robust infrastructure is quite frankly unforgiveable.
umbria
May 6, 2011 3:18 PM
Brett is right - the IT staff had no other choice, and the blame falls squarely on the board for not agreeing to fund hot-swappable DR facilities for all customer-facing live services. When the failed site is available again, data already updated to the DR site is rolled forward to the offline site, and when they are mirrored again, they return to the normal, redundant operating condition. It is unforgiveable that Westpac does not have this arrangement in place. The same applies to all large companies with customer-facing facilities, but especially banks.

RattyRatt and Bob, Westpac's greed left customers all over the world standing at petrol bowsers with no cash to pay for fuel already pumped, and at airport counters unable to pay for flights. It left mothers with full shopping trolleys unable to pay. And online banking was offline for the entire business day, from 0600 to 1630, which would have left many customers without a window in their day to move money to cover direct debits.

This was a major, major failure, which parallel running data centres automatically prevent. Shame.
pameacs
May 7, 2011 8:33 AM
Poor management, so many businesses don't have effective DR plans. Technology today has some excellent products no matter what vendor to allow it and if your vendor cant them you need to be moving to one who can. I remember one of the big banks or stock traders affected directly by 9/11 was operational with partial services in there DR in a few hours and was fully operational there in a few more. It was a case study I think IBM trotted out for a while after, they may still do. Yes it can be done, no it doesn't have to be hellishly expensive is you have good architects who don't have personal allegiances to a vendor and are willing to do their homework. It does have to be tested routinely. This makes them effective.
Imagine if pilots did DR like some organisations do. Hold on we have engine number 4 out. Ahh just shut it down, ok now we have number three out, look lets just shut them all down just to be on the safe side hmmmmm. Pilots have training and operations manuals to handle DR and so should have Westpac. Lets face it they are one of 4 businesses that have had record profits for the last ten years consecutively, they have no excuse.
Then again this is probably another reason for not outsourcing, loss of control of important business systems.

Comments have been disabled for this article.
 
 
 
Top Stories
Australian miners send drones to work
In-depth: Unmanned aerial vehicles in the resources sector.
 
The New Zealand telco problem
Opinion: Could Telstra save Kiwi telcos?
 
IT price probe to 'name and shame' gougers
Industry ducking the issue, committee claims.
 
Time for an upgrade?
Sign up to receive iTnews email bulletins
   FOLLOW US...

Latest VideosSee all videos »

Latest Comments
Polls
Should the Government enact new legislation to protect copyright holders in the digital age?

   |   View results
Yes
  19%
 
No
  81%
TOTAL VOTES: 510

Vote