Peer-to-peer communications provider Skype is considering new mechanisms to automatically update users to the latest versions of its software in the wake of a 24-hour outage suffered in the lead-up to Christmas.
The company has also promised to invest in an infrastructure upgrade.
Skype's services went down for 24-hours from December 22 into December 23 after a series of cascading faults upset the delicate balance of its P2P-delivered service.
Skype CIO Lars Rabbe has posted a blog entry explaining the root cause of the problem and how Skype intends to prevent it happening again.
Rabbe reported that cluster of support servers responsible for Skype's offline instant messaging had become overloaded - which in itself would not have proven problematic.
Skype clients awaiting messages from the overloaded servers simply received a delayed response.
The real issue for users was the result of an undiscovered bug in version of the Skype for Windows client (version 5.0.0152) - which could not process these delayed messages, causing the client to crash.
Skype estimated that some 50 percent of its subscribers are using this version of the client - and 40 percent of those clients crashed just as Skype entered its peak usage time.
Between 25 and 30 percent of these crashed clients has been set-up as what Skype refers to as 'supernodes' - acting as a phone book to redirect requests between other users.
"A supernode is important to the P2P network because it takes on additional responsibilities compared to regular nodes, acting like a directory, supporting other Skype clients, helping to establish connections between them and creating local clusters typically of several hundred peer nodes per each supernode," Rabbe said.
"Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again. As a result, the P2P network was left with 25–30% percent fewer supernodes than normal. This caused a disproportionate load on the remaining available supernodes."
The pressure on remaining supernodes was considerable - with around one in five Skype clients attempting to re-connect to the network simultaenously. Traffic loads on the Skype network, Rabbe said, were around 100 times higher than usual.
"Supernodes have a built in mechanism to protect themselves and to avoid adverse impact on the systems hosting them when operational parameters do not fall into expected ranges," he said. "We believe that increased load in supernode traffic led to some of these parameters exceeding normal limits, and as a result, more supernodes started to shut down. This further increased the load on remaining supernodes and caused a positive feedback loop, which led to the near complete failures that occurred a few hours after the triggering event."
Skype had attempted to build new, larger supernodes (the preposterously named mega-supernodes) on the fly to handle some of the additional capacity, and also disabled group video calling to ease traffic load, but was unable to match the scale of the problem.
The outage highlights the risk associated with distributed systems - Skype simply had no control of the many distributed nodes around the world that mesh together to deliver its services.
Rabbe recommended users download the latest version of the Skype for Windows client.
"We will also be reviewing our processes for providing 'automatic' updates to our users so that we can help keep everyone on the latest Skype software," he said.
The company was also "reviewing our testing processes to determine better ways of detecting and avoiding bugs which could affect the system," he said.
"We know how much you rely on Skype, and we know that we fell short in both fulfilling your expectations and communicating with you during this incident. Lessons will be learned and we will use this as an opportunity to identify and introduce areas of improvement to our software, further assess and invest in capacity and stability, and develop better processes for outage recovery and communications to our user base."