An interim report into the failure of the UK National Air Traffic Services (NATS) flight control systems has pinpointed the root cause of the mishap as a server failure caused by incorrect software programming.
The system failure caused chaos in British airspace, with 120 flights cancelled and hundreds more delayed during December, one of the busiest travel times of the year.
A programming error caused the System Flight Server (SFS) to shut down, the report [pdf] by an independent panel of enquirers found.
In particular, a look-up table for "Atomic Functions" or unique identifiers used to ensure the SFS supplies correct data to air traffic control workstations was incorrectly coded as having a maximum capacity of 151 entries rather than the correct number of 193, the report noted.
As a controller workstation entered a watching mode of operation, the SFS checked that this command was valid. According to the report, that task involved creating a list of active Atomic Functions.
Since the total number of Atomic Functions in use at the time of the incident was 153, the primary SFS decided it had exceeded the maximum system capacity, a situation that should not occur.
The SFS is believed to run on an IBM ESA/390 mainframe, which is programmed to shut down to avoid supplying corrupt data to controller workstations when a fault like the above happens.
When the secondary fail-over SFS attempted to take over to continue operations, the workstation command to enter watching mode was replayed and the same error occurred.
With both SFS operations channels down, the entire air traffic control system failed, and flights at UK airports were prevented from taking off as a safety precaution.
The report said air traffic controllers continued to have up-to-date radar images of aircraft, and could communicate with them.
There is no electronic assistance to predict, monitor and detect conflicts between aircraft, and controllers cannot coordinate transfers of flights between sectors. Instead, controllers relied on their experience and expertise, and used phones to coordinate aircraft.
A combination of a latent SFS software defect that has likely been present since the program was coded in 1990s along with a system change late last year to increase the number of Atomic Functions was the proximate cause of failure as a workstation was put into watching mode, the report said.
The inquiry will look into the design of the system, and investigate why a problem in the software was not handled where it occurred and instead triggered the shutdown of the SFS channels; and why the system is designed to automatically replay commands upon SFS failure, an action that caused the back-up system channel also to shut down.