ANZ Banking Group is set to “heavily” embrace site reliability engineering in the way it operates its forthcoming connected banking platform, now being built under the ANZx transformation.
The transformation’s general manager of technology Chris Venter revealed that he - and other ANZx staff - visited Google’s Sunnyvale headquarters, where site reliability engineering (SRE) was “a big part of [the] discussion.”
As a discipline, SRE’s owes its roots to Google in the early 2000s. It uses software to resolve operational challenges such as “availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning”, according to Google documentation.
“My team and myself recently went to Sunnyvale and we spent some time with Google, and a big part of our discussion was on Site Reliability Engineering,” Venter said.
“We got to learn a little bit about how Google does Site Reliability Engineering, and it's a concept we're going to be embracing heavily as we build out our capabilities.”
The bank also met with Google developer advocate Kelsey Hightower on his return to Australia last month, having first hosted Hightower in Australia nine months earlier where he spent time with ANZ’s engineering squads “learning about our technology landscape and some of our engineering practices”.
Hightower said he had seen improvements in ANZ’s engineering and cloud maturity in the nine months between visits.
“Nine months ago, there were a lot of ideas, diagrams, graphs and ‘what we want to do’ or ‘what are the best practices’,” Hightower said in comments recorded at ANZ that were revealed this week.
“This time around, there's a lot more live demos. That's a big leap between ‘we have this idea’ to ‘here's what we've done’.
“I think you've [ANZ] gone from asking for the best practices to actually practicing [them].”
Hightower said ANZ’s challenge would be to “take the cultural aspects” of how Google approached SRE and to apply them in a banking context.
However, he noted that ANZ appeared to “have the foundation pieces in play” to do so.
ANZ said in recruitment advertisements this week that engineers “within the ANZx Platforms SRE squad ... will be responsible for building and running the cloud platforms of Australia's largest financial institution.”
“You will also be responsible for designing and writing software to improve the reliability of ANZ services,” the advertisement reads.
“Don't let that scare you though, we're doing things differently for the cloud.
“You will apply SRE principles to ensure we monitor and measure what is important, you will automate our platforms and services for reliable operations at scale, including automation of our security and governance controls.
“You will also assist development teams to apply the same principles to achieve quality in their releases and will be a subject matter expert on running services well on the cloud, sharing this experience with engineers across the organisation.”
'Stop the train'
Hightower said one mark of a good SRE team was in them having “a little bit of authority to stop the train” - essentially to call a temporary halt to activities while a performance issue is fixed.
“As a bank, the highest level SLA that you have is the customer experience: can the customer buy things? Can they see their transactions? Can they transfer money?” Hightower advised.
“When that experience is disrupted by the technology behind it, it's your SRE team that understands that part of the business and has the monitoring and metrics in place to say, 'Hey, we have to stop the train'.
“Maybe you have to pause releases for a little while until you focus on reliability. Maybe you have latency targets that affect the experience.
“Then maybe there's going to be software engineering that goes specifically to keep the experience bar where you've promised your customer.”
Hightower also said one of the individual traits he looked for in an SRE engineer was an appreciation for “the boring aspects of this job”.
“Reliability can sometimes be a very boring thing,” he said.
“It's not always about building new features or changing a technology because it's new.
“It's really understanding what reliability means, and doing your very best to defend it.
“So the qualities I look for are: if you observe that the system is getting slow, do you have the ability to go as deep as possible … [and] do whatever it takes to get to that true root cause analysis, step back and present that data to the business and to the team, and then follow that up with either custom tooling, improving the libraries that are being used, or maybe writing a little bit of code.
“That's the philosophy that I'm looking for is someone that can identify a problem, own it to completion and back it up with data so that it becomes the culture.”