At 1:00AM this past Tuesday, technicians at Melbourne IT-owned hosting company WebCentral detected a fault on what iTnews now understands to have been an IBM SAN (storage area network) controller.
It was the first of a consecutive series of system failures that tested the company’s capacity to deal with a crisis.
In the aftermath of the incident, Melbourne IT chief technical officer Glenn Gore has been left the job of explaining what went wrong.
Gore conceded that Melbourne IT failed to adequately communicate with customers and that a 72-hour outage is simply not good enough.
But he insisted there was a good reason it took so long to restore service.
“We might have looked a bit slow in our response, but the integrity of customers email was absolutely the top priority. There was a lot of checking and rechecking of data,” he said. “To the best of our knowledge, we haven't lost any mail.”
The outage, blow-by-blow
WebCentral technicians note that one of the controllers on a storage array has failed, creating instability in the mail platform.
The SAN in question supports a number of services, none larger than WebCentral’s shared email, and is storing some 20 terabytes of data.
Gore refused to divulge which vendor supplied the array that failed, but iTnews has since learned it was an IBM array.
Gore insisted it was not the same SAN blamed for a WebCentral web hosting outage a month earlier. That system is hosted in a separate data centre altogether, he said.
The SAN is operational but not behaving properly. As users come online to check their mail, the mail platform (made up of eight front-end processing servers and ten back end services connecting to some ten terabytes of mail data) cannot keep up.
By 9:30am customers are noticing problems with accessing their email.
The underlying storage unit fails. WebCentral appoints a “Critical Incident Manager” and an incident team to deal with the problem.
WebCentral technicians spend most of Tuesday focused on recovering the SAN, with the help of staff from the SAN vendor in question.
“It was effectively offline,” Gore says. “It took all day to get the SAN to a recoverable state.”
A flood of customer calls wash in reporting connection issues.
Melbourne IT’s communication teams record a message informing customers of the problem, which is set to automatically play when customers call in. But the volume of calls is so large the recorded message feature stops working.
Melbourne IT estimates that only 25 per cent of callers ever heard the message.
WebCentral technicians work into the evening and are able to restore the SAN without having to revert to back-up data.
A morning shift of technicians arrive at 6:30am to relieve the night shift, which works through until 8:00am and 9:00am to ensure a smooth changeover before the morning rush of email.
The team start the mail system servers back up to look at the file system and see what state it is in. All the servers mount the file systems that hold the mail data successfully.
As the morning load of email comes on, the mail platform begins suffering from data corruption issues. Some of the back end services are crashing and coming back online with data corruption errors.
Fearing the potential for data loss, WebCentral technicians take the system offline and begin analysing the file systems to check on the integrity of the data.
With the system offline, customers again have no access to mail.
By midday, WebCentral has lost 50 per cent of its 10 mail stores due to corruption.
The sheer size of the data stores in question made checking the integrity of the data a long and laborious process, Gore said.
“Your best case scenario would be for each of the three integrity checks [required] to take two to three hours per message store. You run the integrity check once without modification, which is two to three hours. You run it again with modifications enabled: another two to three hours. Then you run it for a third time to make sure the modification hasn't caused problems.”
“We had to run each of these checks five and six times per message store to make sure we cleared the corruption,” Gore said. “We didn't bring the servers back online until we could guarantee there was no corruption in a given mail store.”
These data integrity checks continue into the night and Thursday morning.
By the time the morning rush hits on Thursday, WebCentral is confident the SAN and mail stores are 100 per cent back online.
At 9:00am customers are able to access some of the backlog of emails, but by 9:30am connection issues set in from the large amount of customers trying to access accounts.
Gore claims WebCentral maintains 50 per cent excess capacity on its SAN and server processing capacity to “soak up spam attacks” and the like. But with two days of queued e-mails coming back online, the system can’t handle the load.
“Mail systems were trying to deliver two days of queued mail,” he said. "Plus we had customers trying to access their mail on clients.”
The mail platform behaves inconsistently up until lunchtime.
By 12:40am there is a “big improvement in the behaviour of the mail platform.”
WebCentral engineers make a few tweaks to the routing on their network to improve speeds. But there are still intermittent errors – due mostly to the volume of queued mail.
“By this time customers had been down a couple of days,” Gore said. “Many had changed the settings on their client looking for a workaround”.
WebCentral attempted to advise a small subset of customers to revert their settings back to normal to begin receiving mail again.
Technicians spent the remainder of Thursday night making minor changes and communicating with customers.
Gore and his team believe there to be no further problems with the mail platform, but hold off officially calling it resolved until well into the afternoon, paying close attention to how it handled the morning load.
Customers report the system is working as per normal.
Click through to Page 2 to read about what WebCentral has learned from the outage.