Telstra executives have opened up about the causes of three major outages that hit Telstra users in the past two months in an effort to regain the trust of the telco's vast customer base.
Australia's largest telco was forced to offer two 'free data' days to compensate users for three significant outages since the start of the year.
Yesterday's free data day saw customers slam the network to download a record 2686 TB in the 24 hour period, a 46 percent increase compared to the amount of data consumed during the February free data day.
Speaking to the CommsDay Summit today, and later to a media briefing at Telstra HQ, the telco's chief operations officer Kate McKenzie said the company was making every effort to ensure such incidents do not occur again.
She said the telco's initial review into the individual matters found the outages were not related, although two resulted from problems processing the mass registration of mobile devices.
MzKenzie stressed that at no time did the Telstra network suffer a system-wide failure.
"[However] each of these events did impact varying numbers of our customers, and we are working to ensure this does not happen again," she said.
What went wong?
On the morning of 9th February, a fault with one of the signalling nodes used to manage Telstra's 3G and 4G data sessions and voice calls on its mobile network started acting up, McKenzie said - the disruption at the time attributed to an "embarrassing" human error.
"With evidence of increasing degradation of the health of the node and potential service risk, a decision was taken to isolate the node from the network - a standard operating procedure for such an event," McKenzie said.
The node was removed from the network at around 12:30pm, about an hour and a half after the problems were first spotted, but further problems soon arose.
"Due to processes not being followed properly, the subsequent node restart initiated incorrectly," MzKenzie said.
"This meant that 15 percent of all mobile devices connected through this node needed to re-register when establishing a new voice call or data session."
The mass re-registration of affected mobile devices overloaded other mobile signalling nodes, meaning customers were unable to make new voice calls or access data.
The telco decided to prioritise voice services over data services to get customers back online as quickly as possible, McKenzie said, claiming most affected data services were restored by 1pm. All services were restored at around 2:30pm, she said.
The employee responsible for the outage is still working for the company, McKenzie said.
"We're not into victimising people. We understand in the heat of the moment, the right decisions aren't always made," she said.
The next big outage - on March 17 - saw customers unable to make 2G, 3G and 4G voice calls or access data from around 6pm. Around 50 percent of Telstra users were affected.
The issue occured when a significant number of international roaming customers were unexpectedly disconnected from the Telstra network. Domestic customers then followed.
The initial trigger was an international cable fault that caused parts of Telstra's signalling to disconnect and caused the readvertising of many IP addresses, network chief Mike Wright said.
MzKenzie said automatically-set efforts to reconnect all the affected users at the same time overloaded the database used to register devices, as with the previous outage.
"[We] limited the volume of 4G signalling for the devices reconnecting to the network, and configuration changes were made in the mobile network to speed up recovery. These changes reinstated network stability," she said.
Just five days later on March 22, a number of Telstra mobile, IP telephony and NBN voice customers were unable to make or receive voice calls for several hours in the morning, particularly around Victoria and Tasmania.
MzKenzie said the incident only affected around 3 percent of customers.
The issue stemmed from a card failure in a media gateway in Victoria, preventing certain calls from getting through.
Telstra has commenced a wide-ranging review of its network, led by McKenzie and utilising the help of "external experts" from around the world, as well network partners Cisco, Ericsson and Juniper.
"We have already progressed short to medium term actions to improve resilience and robustness in the network," McKenzie said.
"Changes have been implemented to increase the capacity and path diversity of critical signalling channels, and a temporary layer of traffic management protection has been added to minimise the impact of events like that we saw in March and February."
Within the next few days the telco will augment capacity in its home location register - which manages customer subscription data - by adding a blade processor to help minimise the impact of mass re-registration of devices.
McKenzie pointed to the record 2686 TB of data that was consumed in its free data day on Sunday to indicate that Telstra's network was capable of handling strain.
"It's the re-registration of devices that is the problem," she said.
Immediate findings from the review will see the telco introduce configuration changes and new rules for interfacing, among other short and longer term changes.