Qantas is using infrastructure-as-code and operations-as-code to run at scale in the cloud with a core platform team of just 12.
The airline used the recent AWS Summit in Sydney to reveal some of the practices it has put in place over the five years it has been running cloud-based workloads, and the results that has helped it to reach.
“At any one time at Qantas, we're running about 4000 instances and that's a mixture of multiple instance types and multiple operating system types,” Qantas cloud services lead Steven Tyson said.
“To date, we've launched about 2.7 million instances and we've got a significant spend across hundreds of accounts.”
Tyson said Qantas used a blue-green architectural approach to get code changes into production.
Under a blue-green model, the production environment is cloned. Live traffic runs through one environment while changes are introduced to the other, and then traffic is cut back across.
“We perform on average 5500 blue-green releases a month over about 250 applications,” Tyson said.
“All deployments are done leveraging pure infrastructure-as-code, and we have a custom in-house abstraction layer which looks after the DNS switch to do the blue-green for us.”
The numbers themselves provide useful context around how much Qantas has been able to achieve with a small cloud platform team.
The cloud environment is known internally as QCP - Qantas Cloud Platform.
Qantas used the same conference to announce plans to migrate workloads currently run on mainframe and midrange infrastructure into the cloud by 2021 - meaning the QCP team’s workload isn’t likely to decrease any time soon.
“At 12 people, looking after such a large cloud fleet is no mean feat,” Tyson said. “We do that with some really strong practices in infrastructure-as-code, and also operations-as-code.”
Specifically, Tyson cited two key factors in Qantas’ operational success in the cloud to date.
“Qantas do cloud at a massive scale and there are two things they've done to be so successful with such a small team,” he said.
“One, they've democratised the use of cloud within Qantas for application owners and developers to deploy their own applications when they want, and two, they've ensured that 100 percent of all deployments are done leveraging declarative infrastructure-as-code.
“Everything goes through a CI/CD pipeline and we ensure that all environments are exactly the same.”
In instances where Qantas ran into “issues” scaling its use of AWS, Tyson said that an AWS service called Systems Manager had proven useful.
Systems Manager could help companies look after large and complex cloud fleets by pooling resource and application management information in a single place, according to AWS.
“Before Systems Manager, we had no real way to govern who was doing what in our EC2 fleet,” Tyson said.
Within Systems Manager, the QCP team is using Parameter Store as “a central source of truth [for] all sorts of keys, values and different pieces of data”.
“You can configure your cloud resources to query that at deployment time,” Tyson said.
“Some examples of things you can keep in the Parameter Store [include relational database service] RDS passwords, connection strings, a particular proxy if you need to get to certain internet endpoints, API keys, passwords.
“The benefit of having that in one central location like the parameter store means that some of those sensitive values aren't kept inside your repositories so it should help increase your security levels.”
The team set “very granular permissions” around the values in the Parameter Store that particular applications could access.
It was also possible to see “a breakdown of who's been doing what inside your Parameter Store”.
“And using [AWS] CloudWatch events, you can have things that happen in your Parameter Store be injected dynamically into a central SIEM where as and when important things happen, a security analyst could be informed of what's going on,” Tyson said.
Tyson said the QCP team is using another feature of Systems Manager, called Documents - essentially “declarative pieces of code ... you can attach to an EC2 instance, and it'll do a certain admin function for you on that ‘box’.”
“Within Qantas, we don't really like baking users onto servers so we try not to - and in some cases, we have servers that might not have any users on them at all. These might be servers that are part of an immutable model where they're inside an ASG,” Tyson said.
An ASG or Auto Scaling Group is a collection of EC2 instances configured to scale automatically as load increases.
However, Tyson noted, “there are odd occasions when human beings do need access to these boxes, and we have a 'break glass' access system to enable human access.”
“Leveraging [Systems Manager] Run Command and Documents, a human being can enable RDP and SSH access on an EC2 instance and add a new local administrative user to start a session to that box,” he said.
“Once they've done [with] whatever forensic work they need against that server, they can log off that box, and then run a subsequent workflow to delete those credentials.
“That means that credentials are added on-demand for human access, and again, it can be integrated into a SIEM so that as and when human access is enabled, an analyst could be informed.”
Setting a patch baseline
Tyson said a future use case of another component of Systems Manager - Patch Manager - is currently being explored.
Patch Manager “automates the process of patching managed instances with both security related and other types of updates”, according to AWS documentation.
“We're currently using Patch Manager with maintenance windows in a very small MVP [minimum viable product] at Qantas, and this is a fantastic tool,” Tyson said.
“Leveraging this, you can set a patch baseline on an AWS account and say 'this is the minimum acceptable patch baseline for an instance in this AWS account', and then you can configure your EC2 instances to scan themselves against that baseline and tell you if they're compliant or not.
“In the event they're not compliant, you can leverage maintenance windows to say 'Patch this instance at this time. And when you do it, stop this daemon, install patches, change config file, reboot, restart daemon' - all through declarative infrastructure-as-code.
“So it's very powerful and we're going to keep on using that tool, especially in some of our more long-lived instances.”