Telstra outage falls to procedural error

 

Analysis: Were the proper procedures in place?

Network engineers have speculated router configuration and route limiting procedures on Telstra equipment were to blame for network outages that affected millions of services yesterday.

Seen as one of Australia's worst in recent years, the 35-minute outage saw Telstra routers incorrectly redirect traffic from its own subscribers as well as those of its wholesale customers through ISP Dodo's routers and to effective dead-ends.

Millions were affected on Telstra's domestic networks as well as those operated by iiNet, Optus, the four major banks and multiple large enterprises.

NBN Co, also a peering partner on the affected router, did not respond to questions at time of writing.

Dodo CEO Larry Kestelman confirmed the outage was the result of "a hardware issue with a Cisco border router" on his company's network which attempted to change the way it routed traffic for its subscribers to the internet.

It is thought Dodo effectively "advertised the entire internet", made up of approximately 400,000 routing prefixes. These were accepted by Dodo's bandwidth wholesaler, Telstra, and propagated to all other Telstra customers accessing the internet through the telco's AS1221 peering exchange.

One observer, Michael Keating, explained that "any traffic destined to go overseas, was in fact effectively being routed back towards Dodo".

"This meant that Telstra lost the ability to communicate data overseas, the local network became saturated with data and became unstable, and anyone using Telstra for international capacity suddenly stopped working," he said.

Dodo's Kestelman yesterday said that, "in normal circumstances, this would not result in a network outage.

"However, it appears that these routes were accepted by Telstra and propagated to Telstra's downstream customers rather than Telstra simply filtering the routes.

"This caused major issues for Telstra and its customers which should have been avoided."

Though much of the blame has been laid on Dodo's network issues yesterday, network engineers told iTnews that some of the blame should be laid with Telstra for failing to filter or limit the number of routes Dodo purported to provide for access to the internet.

Engineers said that, in a best practice situation, Telstra would limit the number of prefixes Dodo advertised to the network through a configuration feature available on Cisco and other routers for more than ten years.

Bad practice

Geoff Huston, chief scientist at the Asia Pacific Network Information Centre (APNIC), said Telstra used to employ prefix filters but he could not ascertain whether they were still in use on any of the routes.

Recent practice in Telstra and other Australian carriers had been to rely on administrative processes and "trust" to implement the Border Gateway Protocol, the underlying technology that led to yesterday's issues.

"I suspect that we're not as careful as we should be with the use of routing databases - in fact Australia is pretty bad in its use of routing databases," he said.

"Certainly ten years ago, we didn't use customer-level filters. Something like Dodo couldn't have happened then."

Vocus managing director James Spenceley said such filters are still common practice among other Australian carriers but seemed not to be in use, at least in Dodo's case.

"Filtering has been absolutely mandatory best practice since the 90s," he said. "We've all invested our time and money to make sure we can do it."

A post-incident report is expected from Dodo and Telstra in coming days, likely revealing the extent to which either party could be allayed the blame for the outage and whether any changes to administrative processes would be recommended to avoid future repeats.

A Telstra spokesman confirmed "steps were taken [Thursday] to add additional protection to our core network.

"A full review of Telstra’s network protection mechanisms is being undertaken," he said.

Some of the service providers affected by yesterday's outage said they would look to diversify peering arrangements in future to minimise the impact on their respective customer bases.

Huston warned Australian providers to become more practised in routing databases in the meantime.

"Maybe we'd all be better off at protecting ourselves from these kinds of slip-ups which, let's face it, always will happen sooner or later," he said.

Update: Added updated Telstra statement and fixed minor inaccuracies.

Copyright © iTnews.com.au . All rights reserved.


Telstra outage falls to procedural error
 
 
 
Top Stories
Business-focused Windows 10 brings back the Start menu
Microsoft skips 9 for the "greatest enterprise platform ever".
 
Feeling Shellshocked?
Stay up to date with patching for the Bash bug.
 
Amazon forced to reboot EC2 to patch Xen bug
Rolling restarts over next week.
 
 
Sign up to receive iTnews email bulletins
   FOLLOW US...
Latest Comments
Polls
Which is the most prevalent cyber attack method your organisation faces?




   |   View results
Phishing and social engineering
  66%
 
Advanced persistent threats
  5%
 
Unpatched or unsupported software vulnerabilities
  11%
 
Denial of service attacks
  6%
 
Insider threats
  12%
TOTAL VOTES: 1379

Vote