Telstra missed a transmission controller card failure amid a sea of network alarms in early May, meaning it also missed early warning signs of problems with Triple Zero emergency call routing.
A post-mortem of the Triple Zero outage on May 4 - which was initially blamed by Telstra on a lightning strike - shows a far more complex chain of events led to the outage.
The origins of the incident have been traced to a controller card failure in a transmission device located in a Victoria exchange.
The failure triggered an alarm at 3.46am on May 3; however, it was one of “approximately 26,000 alarms present in the alarm system, 13,000 of which were rated as critical”.
Although that volume of alarms was considered “typical” by Telstra, it appears the number - combined with the appearance of only “intermittent” transmission issues in and out of the exchange - caused it to be overlooked.
However, as the post-mortem states, “the ‘loss of communications’ meant that no subsequent alarms were presented to Telstra network management staff in the event of any further failure in the transmission equipment, because that equipment was disconnected from the monitoring system.”
The post-mortem reveals that Telstra experienced a series of other problems with its network over the course of the following 24 hours, but did not recognise the root cause as being the still-overlooked card failure.
In that time - and unbeknownst to Telstra - ACT emergency services started experiencing problems with Triple Zero call routing.
The Triple Zero problems increased by 2.05am on May 4, when the pit fire near Orange in regional NSW occurred. That failure, combined with the card problem, led to widespread PSTN routing problems.
It was only at 7.30am on May 4 - about 28 hours after the card failure alarm was missed - that “Telstra identified the potential significance of the Link 1 transmission controller card failure that commenced on 3 May.”
“Actions were initiated to replace the controller card, which was suspected of preventing Link 1 from carrying PSTN traffic inbound from NSW, the ACT and Queensland to the Melbourne Triple Zero call centre, and outbound to NSW, the ACT and Queensland from the Melbourne Triple Zero call centre,” the post-mortem, published by the Department of Communications, shows. [pdf]
Full Triple Zero connectivity was restored at around midday on May 4.
However, the large-scale problems meant only 5148 out of 12,224 calls placed with Triple Zero actually reached a Triple Zero operator. Others hung up or failed to connect.
The aftermath has seen Telstra make a series of upgrades to its network monitoring, and commit to other actions through a court-enforceable undertaking with the Australian Media and Communications Authority (ACMA), which was also released today. [pdf]
“Telstra has implemented improvements to its alarm monitoring systems and service monitoring dashboards to increase the accuracy and timeliness in detecting events that may impact emergency call services,” the undertaking states.
“These improvements include using data analytics to provide increased capabilities in monitoring and to identify patterns in network events that may impact emergency calls when there is a lack of root cause parent alarms.
“[They also include] improvements to the emergency calls dashboard to enhance the visibility of events which may impact emergency calls and to minimise display latency.”
The government report indicated that the analytics system put in by Telstra is capable of providing “synthetic critical alarms arising from the collation and correlation of multiple lower order alarms”.
It also said Telstra had “automated tickets of work to initiate incident restoration activities”, aimed at preventing the 24 hour delay for the job of replacing the card being assigned.
Telstra also updated software on all its core routers and uplifted its incident management and stakeholder communications plans in response.
“This was the first serious disruption to the Triple Zero service in more than 50 years,” Communications Minister Mitch Fifield said in a statement.
“With the measures the government putting in place, Australians can feel confident the service will have greater safeguards in times of need.”
The Department of Communications said it is “also in discussion with Telstra to implement a new IP platform to facilitate next generation Triple Zero capabilities, as well as Advanced Mobile Location (AML) to provide more accurate location information by automatically sending coordinates to Triple Zero.”