A short disruption to the power supply in open source repository Github's data centre was behind the outage last week that caused its services to become inoperational for users.
In a post-mortem of the incident, Scott Sanders of GitHub's engineering department said the interruption in power meant just over a quarter of the repository's servers and several network devices rebooted.
He said the front-end load balancers and application servers were unaffected, but couldn't serve up requests as the backend systems werr down.
Initially, GitHub thought it was once again under a large distributed denial of service attack (DDoS) as happened in August last year thanks to an increase in connection attempts, which pointed to a network problem.
As a result, GitHub spent time early on during the incident bringing up its DDoS defences, before it realised no such attack was taking place.
Customers meanwhile were none the wiser as GitHub's ChatOps team communications systems were on the servers that rebooted. This lead to an eight-minute delay before the status.github.com service indicator was set to red, providing users with a notification that the site was down.
Several factors conspired to slow down restoration of service for GitHub.
Servers using specific type of hardware, and spanning many racks and rows in the data centre, meant that several members of Github's Redis in-memory data structure cluster were inaccessible.
As GitHub had by mistake added a hard dependency on the Redis cluster being fully available in the boot path of its application code, the programs failed to start.
GitHub had to urgently repair the servers that were not booting up so as to restore the Redis clusters to allow the application processes to restart, Sanders said.
"One group of engineers split off to work with the on-site facilities technicians to bring these servers back online by draining the flea power [residual current] to bring them up from a cold state so the disks would be visible," he wrote.
A firmware bug in the servers that rebooted meant they weren't able to see disks after power came back, Sanders said. GitHub will update the servers in question and update its toolset to open tickets for when firmware updates become available.
Sanders said GitHub doesn't believe it is possible to fully prevent large parts of its infrastructure losing power, but the code repository will take steps to ensure the recovery is fast and reliable, and to mitigate the negative impact on users.
Improved cross-team communications "would have shaved minutes off the recovery time" Sanders said, and improving messaging about incidents to users will also be implemented as part of a number of improvements by GitHub to reduce the effect of catastrophic outages.