UberGlobal finds root of Friday outages

Ry Crozier | Jun 28, 2011 1:59 PM
Kit changes expose network design issue.

Web hosting firm UberGlobal has isolated the root cause of several outages late last week to the configuration of a spanning tree in its network.

The company – which owns brands including AussieHQ and Jumba – suffered three outages last Friday that affected customers in Sydney and Canberra.

A preliminary assessment suggested a piece of customer equipment that was moved on the evening of 23 June was the likely cause.

But continuing investigations uncovered a network design issue that was exposed by several equipment changes, according to a final post-incident report released today.

Specifically, the company moved a four-year-old load balancer to a different part of its environment late on 23 June.

The shift was part of a wider initiative at UberGlobal to segregate its burgeoning enterprise business from the network elements used by its smaller business/consumer web hosting brands.

When originally installed, the load balancer had been configured to act as the root bridge of a spanning tree for a small number of virtual LANs.

Spanning tree topology is used to establish virtual LAN paths and manage redundant links in networks. The piece of kit at the base of the tree is called a root bridge.

In isolation, moving the load balancer did not cause an issue, nor did the addition of a new IBM BladeCentre to the environment a day later.

But both forced recalculations of spanning trees in UberGlobal's management network. It was the second recalculation – when the BladeCentre was introduced – that led to the outage.

After the recalculation, the load balancer moved on 23 June could not 'see' a legitimate root in its new location, so it attempted to elect itself to the role.

Other devices in the tree that could see the load balancer and the legitimate root started receiving conflicting messages from the two boxes.

The messages started to loop between affected devices in the tree (circumventing its purpose). An error disable feature of UberGlobal's Cisco access switches recognised the loop error and shut down the affected ports.