A faulty line card and an incorrectly configured secondary switch have been outed as the root cause of a nationwide outage of customs systems at Australian international airports earlier this year.
The major outage, which forced Border Force officers to resort to manual processing for more than six hours, took place on April 29, creating long delays for inbound and outbound overseas travellers.
Cargo, traveller and other border systems were all affected by the outage, rendering automated processing mechanisms such as arrivals and departures smartgates unusable.
In its review of the incident, released last via freedom of information laws last week, the Department of Home Affairs revealed faulty hardware had been responsible for the outage.
“The cause of the incident was identified as a hardware failure, specifically a line card on network distribution switch 1 at [redacted] data centre,” the department said.
“To restore services, the faulty card was removed and minor patch configurations performed.”
However, the department said the issues, which were “isolated to IBM”, were compounded by a “configuration issue [that] prevented failover to a second switch”.
This secondary issue was identified by IBM in its attempt to failover to the secondary switch, which should have occurred automatically.
“This failover did not occur successfully because the static route was missing on the edge switch to route traffic to and from distribution switch two,” IBM said in their post implementation review.
Once this had been identified, the department said “switch one was restarted again and services were moved from the failed line card to another working line card on switch one, which involved physically moving cables from the failed line card to a working line card on switch one”.
“When relocation of the cables was complete, traffic started to flow on switch one. This resolved the issue and allowed traffic to flow over the WAN,” the major incident review states.
An “emergency change” was also required to find a permanent solution for the secondary switch after another major incident was declared.
“IBM have prepared a risk assessment and recommendation to action the replacement of faulty hardware and complete a test of the failover,” the report states.
The department also said “a restart of JVM’s [Java Virtual Machines] for [redacted]” was separately required the traveller and cargo systems, as were “some local reboots of smargates ... to trigger connections to the network”.
Another nationwide outage affecting just departures smartgates on July 15 has also detailed in the bundle of FOI documents, though the root cause “is not yet known”.
The outage of the Travel and Immigration Processing System (TRIPS), which resulted in the unavailability of “expected movement data”, caused delays in passenger processing for more than six hours.
The department said the issue was attributed to an “authorised change ... that caused an ICT border mainframe communications device (BROKER) to fail, causing [redacted] processing to queue at the mainframe”.
However, at the time of the major incident report, the department said the root cause of the failure was unknown.
“Root cause from the BROKER failure is not yet known, however, logs from the BROKER have identified that there was some communication issue between the Adabas processes and/or between the client/server,” the major incident report states.
“There is ongoing dialogue between IBM and the vendor, Software AG, who have advised that there is a fix that has been release in June for this BROKER issue.
“Problem Management investigations are continuing.”