Travel industry IT giant Amadeus has boosted its investments in data centre redundancy in the wake of uncharacteristic system outages over the past two years.
Amadeus provides outsourced IT solutions and a reservation engine for some 100 airlines, including Qantas and Virgin in Australia, and considerably more hotel, cruise ship and rail operators.
The company was founded in the 1990s by Luthansa, Air France, Iberia and SAS and competes with the likes of Sabre, Gallileo and Navitaire.
Amadeus has long touted a 99.9 percent availability record since it first began offering services from its Germany-based data centre in the early 90s.
But this record has been savaged in 2011 and 2012.
This week iTnews took a tour of Amadeus' data centre in Erding, Germany — touted as the largest civil data centre in Europe — to discover how the company has responded to the outages.
A complex challenge
A look at the sheer number of transactions Amadeus processes on any given day gives some insight into the technical challenge it faces to keep systems up.
Amadeus’ systems cope with a peak of over a billion queries a day, leading to 3.7 million complete travel bookings. As new airlines in growth markets like Asia sign on to use the community-based system, those numbers grow steadily larger.
Prior to the growth of the internet, when Amadeus customers were predominantly travel agents, most queries to the system resulted in a transaction.
But over time, as consumers have increasingly been empowered to check and book travel directly over the web, the ratio of queries to bookings has sky-rocketed. Customers are searching not just along the lines of availability but also price and other factors.
The ratio has grown from 27 queries for every booking in 1996 to 270 queries per booking today.
This puts an incredible strain on IT resources. Every aspect of Amadeus’ IT footprint — be it server, storage or network — has had to remain at the bleeding edge to cope with this growth.
Managing that change process has become the primary obsession at Amadeus’ huge processing centre.
Downtime is clearly a subject of considerable embarrassment and anxiety for executives and engineers at the expansive facility. Outages are antithetical to both Amadeus’ considerable investments in technology and its culture of strict control processes.
Every one of the 500+ staff members at the Erding facility, for example, have trained in ITIL processes, including those in administrative and support functions.
These control processes underpin the 4000 IT changes Amadeus makes per month — be it small component upgrades in Amadeus’ reservations software, the patching of operating systems or physical changes like a network connection.
“Amadeus has a sophisticated change process,” said Matthias Koll, infrastructure manager at Amadeus' data centre.
“Move a piece of equipment without telling anybody and you can assume it’s your last day at Amadeus.”
Any change is tested on standby systems before going into production, and standby systems are regularly spun into operation during maintenance cycles to ensure they’ll be working when relied on.
And still, as is Murphy’s Law, outages will occur.
“You can prepare for most situations,” Koll said.
“You can take the most scientific approach, but still if something can happen, something will.”
One outage in 2011 was blamed on the primary disk failing. A secondary back-up disk immediately spun up to take the load, before it suffered the same degradation.
Last month's short outage was found to be caused by a bug in the Linux operating system.
Similarly, Amadeus thoroughly tested the introduction of a new switch in the data centre in January, only to discover a bug in Cisco's operating system after the switch was in production.
“In a data centre this large, there is a limit to how much you can test,” Koll said.
“You test a new piece of equipment in the context of a few customers, and it looks good. But then in production there are 3000 customers every minute.”
The company is making considerable investments in new tools to keep systems available.
Amadeus unveiled a 9.3 percent, €16.3 million ($AU18.97 million) increase in net indirect costs in its half-year report last week, attributed specifically to a “higher investment in our data centre in Erding to ensure a sustained level of maximum reliability".
The company has invested in a 'multi-peak' architecture that “groups individual clients into distinct isolation groups in order to avoid large impacts to all clients in case of scheduled or non-scheduled down-times", a company spokesman told iTnews.
Further, the company is determined to learn something from every outage.
A dedicated post-mortem room is attached to the operations centre for a detailed breakdown of what went wrong after any incident — regardless of whether customer operations were impacted.
The facility is also audited three times a year by an external company to make sure everything is technically up to scratch.
But even when the organisation is as technically prepared as it can be for an incident, processes can be sharpened.
Sometimes a post-incident report might recommend that certain employees be located closer to certain controls. One outcome of such a report is that Amadeus' operations centre now has a central podium where a “conductor” can steer the response to a crisis with clear visibility and authority.
“We always learn something,” Amadeus' spokesman said.
“What you learn is how to cope with unexpected and exceptional circumstances.”