
Ironically, by testing if they could safely shut down the reactor during a power failure, the operators triggered an unstable, and ultimately explosive, reactor configuration. Despite heroic efforts by the reactor staff to control the situation, the resulting radiation leak caused the evacuation of hundreds of thousands of people and much-debated health effects for years to come.
This sort of failure mode affects a wide range of systems. In September 1993, an Airbus A320 was landing in Warsaw during appalling weather. When they hit the runway and attempted to engage reverse thrust to slow down, the A320's safety systems prevented it. The aircraft overshot the runway and two people were killed.
To prevent the obviously dangerous practice of engaging reverse thrust while the plane is airborne, the A320 was programmed to check that it was on the runway first. One of the checks was wheel rotation; unfortunately in this case, the plane was aquaplaning on the wet runway, so the reverse thrust was locked out.
In safety systems research it soon becomes clear that there's a careful balancing act at work between the number of safety and backup systems versus the additional risk of increasing the system complexity. Each time you add an additional check and balance, you also gradually increase the risk that something will go wrong.
The same logic applies to the world of computer security. Every time you add an error trap or input validation routine as a security check you increase the complexity of the software. If you consider this is a purely academic threat, you should review the literature on format-string vulnerabilities and think how many error messages use sprintf or its relatives.
Much has been written about the risk management required when managing software patches, as each patch could cause system problems. This isn't restricted to direct effects either: a recent batch of Microsoft patches triggered a global failure of the Skype internet telephone network. The fault was caused by a huge number of machines "hanging up", rebooting and knocking at the door to come back in.
Less well discussed is the risk management of bug fixes in the first place; should bugs always be fixed? Of course there's a large collection of bugs for which the answer is clearly "yes, and quickly". However in the security business in general, and the cryptology world in particular, we specialise in particularly esoteric risks.
Recently this exact problem came up with a bug reported in the OpenSSL toolkit, possibly the most commonly used crypto software on the planet. The vulnerability involved a "side-channel" attack, where the attacker deduces information about the secret data by watching what happens elsewhere on the machine (cache performance, CPU usage etc).
Discussing the issue on his blog (www.links.org/?p=245), OpenSSL guru Ben Laurie pointed out that to take advantage of this, the bad guy needs to be on the same machine and, by that stage, all bets are off. Commenting that the release of a patch is not free either in terms of developer effort or risk, he suggested it may be time to stop fixing such "theoretical" bugs.
Whether Laurie is right is not really the question; the real point is whether we can persuade the market that sometimes it is better not to rush to fix every bug?
Even the failure of safety systems isn't all bad, though. Following the mass evacuations of Chernobyl and the surrounding "hot zone", the local wildlife has flourished. The most significant advances in aviation safety often follow particularly tragic accidents. Engineer and author Henry Petroski put it best when he said we should beware the lure of success and listen to the lessons of failure. In software security, perhaps we should ignore them both sometimes.
- Nick Barron is a security consultant. He can be contacted at nikb@virus.org.