Macquarie's Banking & Financial Services (BFS) division has created eight “north stars” that are intended to guide the work of its 600-plus IT engineers through to 2025.
Chief engineer for wealth management in BFS, Daniel Coleman, named three of the eight “north stars” during a presentation to last month’s All Day DevOps conference.
A north star is a metric used in business to describe a long-term goal or ideal vision being targeted for achievement.
One of the north stars was revealed by BFS CTO Jason O’Connell at a Google Cloud summit in September, when he suggested Macquarie wouldn’t stop at having 100 percent of its workloads on infrastructure-as-a-service, but would instead chase a fully cloud-native environment on software-as-a-service (SaaS) or platform-as-a-service (PaaS).
It was not clear at the time that the PaaS/SaaS goal was part of a broader set of technology ambitions.
Coleman revealed this to be one of the “north stars” for engineering at BFS.
“There’s about eight north stars that we’ve got that we’re trying to achieve by 2025,” Coleman said.
“For us, it’s meant setting some aspirational targets for teams ... that highlight where we want to get to in 2025.
“An example is we’ve been pushing quite hard to get onto public cloud and we’ll be roughly around 100 percent on public cloud early next year. We’re actually challenging even our thinking there and saying we actually want to be 100 percent on PaaS or SaaS.”
Coleman went on to describe two additional “north stars” around automated testing and event streaming at scale.
“We’ve been challenging the way we’ve been doing testing,” Coleman said.
“An example of one of our north stars is we actually feel that we could automate 100 percent of our testing or get close to [that].”
On greater use of event-driven architectures, BFS has dropped several clues about its event-driven focus this year, both around event brokerage upgrades as well as an event-driven approach to cloud management.
Coleman indicated the event-driven architectural “north star” was about moving from batch to real-time processing.
“An architectural example of our north stars is we actually want to get to 100 percent real time [processing] or utilise event streaming technology,” he said.
“That can be quite challenging in a bank where there’s traditional pieces that are mostly batch or bulk processing, and we’re trying to move away from those things.”
Coleman said it is intended that the north stars act as “lofty targets” or “aspirations” that engineers can keep “in the forefront of their minds” and use to guide their work.
A DevOps foundation
The north stars essentially offer fresh guidance to the bank’s engineers, and layer on top of an earlier set of common DevOps principles that BFS engineers adopted as part of a cultural transformation, which has further influenced their work over the past few years.
“Every business needs to define its own approach to DevOps,” Coleman said.
“There’s no ‘one size fits all’ approach. For us, this meant agreeing on core principles that we wanted our engineering teams to have at the front of their minds so that we could build a strong DevOps culture.”
The core DevOps principles at Macquarie BFS are: to own things end-to-end and break down silos; that failure happens and the key is to detect, recover and adapt fast; implement small changes frequently; automate routine tasks; measure everything, and; to share knowledge and be open to learning.
'Own it end-to-end and break down silos'
The bank broke down silos in part by flattening its organisational structure and making everyone in it an “engineer” by title.
“We wanted to bring people from different specialisations closer together and minimise the handoffs between teams to improve autonomy and velocity, so one of the first steps on our journey was we moved to an organisation structure that was flat and had an open culture where everyone is an engineer,” Coleman said.
“We moved away from specialist roles.
“Engineering isnt just someone who writes code - a developer; it includes people who specialise in architecture, user interface design, testing, business analysis and production support. It’s a person who designs and builds things, so by definition we are all engineers and that was the mantra that we took forward on our DevOps journey.
“This cultural change to a flat and open culture allowed us to break down silos in our organisation and teams, but it also enabled team members to have more career opportunities along the way which was actually the biggest win.”
Coleman said that while “moving everyone to the role of engineer … sounds quite drastic, it was actually taken quite well” by staff.
“The reason for that is because we really explained well the drivers for this and where we wanted to be down the track … [and we] really told people what was in it for them,” Coleman said.
“As an example, to talk a bit more about what being an engineer is in our organisation, we used the ‘T shaped model’ for designing our roles.
“What that is is an engineer is likely to have a deep expertise, that’s the vertical of the T, and depth in a number of different things e.g. software development, programming, but then they’ve got other skills they can use like testing, QA, security, ops, or business analysis.
“What we did is explain that T-shaped model well and the opportunities that would come for our engineers over time.”
Coleman said that engineers were afforded skills development opportunities to aid the transition.
One of the outcomes is having more engineers on staff that are able to handle a broader variety of tasks.
“For example, to move one of our databases to the cloud, in the past that might be limited to just a database administrator, [whereas] what we did was all engineers were able to pick up any aspect of that project,” Coleman said.
“We actually saw really good outcomes where team members were putting up their hand to pick up tasks that they didnt quite have the knowledge but they wanted to learn, and then over time it actually enabled us to move quite quickly because those team members were able to share their learnings, pick up broader tasks and get diversity in the type of things they were doing.”
Also covered by this principle, Coleman said BFS wanted its engineers "to take strong ownership of their full stack [from] top to bottom, even if they didnt directly own or manage the services they depended on."
"In banking this can be challenging as there’s many external dependencies and systems, so we created space for teams to address some of the challenges directly and own them," he said.
"[It also] means our engineers are empowered to make their own architecture decisions and accountable for these."
As part of this approach, Coleman said that BFS had adopted a mantra "that production is the number one feature, and our teams are empowered to prioritise production items ahead of new features", which he said had led to substantial uptime gains.
"Our CIO summarises it well - no one calls our call centres for a new feature or function. No one gets on the phones to complain how come this [feature] isn't here. They call if something isnt working, if something’s broken, if they’re trying to do their day to day financial business or they’re trying to put a trade in and it’s not working, or they’re trying to send money to a family member or whatever it is," Coleman said.
"Those are the things that cause us a toil and obviously frustration for our customers.
"It’s taken really explaining that to our business stakeholders and everyone is aligned to that, and it’s actually created a really good prioritisation model for teams to look at.
"Anything’s that going to cause a production issue or something that’s not quite right in prod that needs to be fixed or maybe one day if not addressed can cause an issue, or if there’s something that’s slow and causes a bit of frustration or potential calls, that needs to go to the top.
"Engineers really like that because they want these things solved because what usually happens is they get support tickets or they get calls in the nights, which impacts their weekends.
"[It's also worked well for] our business - we don’t like to get complaints."
'Failure happens - Detect, recover and adapt fast'
Coleman said that Macquarie BFS has set up a “truly blameless culture” internally, a common pursuit of site reliability engineering (SRE) and DevOps teams generally.
“Through creating a truly blameless culture at Macquarie, our BPMs - blameless post mortems - have a strong level of engagement as everyone is keen to learn from failure, and we arrive at stronger outcomes with teams challenging how we can improve,” he said.
“We spent time bringing BPMs into our business and creating a safe space for teams. All teams were trained on facilitating BPMs and the principles.
“BPMs are now automatically created and teams are sent reminders to implement the actions.”
Coleman said teams were also supported with realistic service-level indicators (SLIs) and objectives (SLOs), with automated systems and bots designed to detect when thresholds for performance aren't met.
“We utilised bot frameworks and set up automated incident creation along with ways for teams to acknowledge [incidents],” Coleman said.
“Incident announcements and the automation piece allowed us to share instantaneously with our business ongoing or inflight incidents so that our call centres, our relationship managers and technical teams could be on the front foot working with customers.”
'Implement small changes frequently'
On making small changes frequently, Coleman said that engineers focused on “test and learn” and gathering faster feedback.
“One example of implementing small changes frequently is our experimentation platforms, which allow teams to trial production features, to learn from feedback and to build the best products we can for our customers,” he said.
“We have a number of dark launch environments where teams can safely build in lab environments which is where our own staff can be the first to trial new products.
“This is followed by a beta which is by invitation, and we have around 1000 customers utilising our beta environment where they can provide feedback in real-time in order for us to improve the product before wider rollout, followed by safe release to production following a canary release pattern.
“Through this process and setup we are releasing over 100 times a day on our digital platform.”
BFS is backing this with considerable investment into continuous delivery (CD), which Coleman described as “a first class concern as it reduces our risk and frees up teams to focus on business value.”
“Continous delivery is one of the largest enablers for improving cycle time and productivity,” he said.
“Further to this, we identified a need to streamline the engineering experience, providing the best experience we could for our engineers.
“With 600-plus engineers we needed to make sure we have the best tooling and make it easily discoverable for all to use, so we created a dedicated team focused on this.
“This enabled us to be more efficient with the goal of increasing developer joy. There’s no better feeling for an engineer than having a computer that just works when it comes time to start a project or make a change.”
'Automate routine tasks'
On automating routine tasks, Coleman said BFS had pursued a “pipelines-as-code” initiative where it effectively “provided CI/CD for [its] CI/CD pipeline.”
“The concept of CI/CD for CI/CD was really set up to address a problem that we started to see where a lot of engineers were building continuous delivery pipelines for their technology stacks, and with 100 teams within BFS in Macquarie we saw multiple pipelines getting created and team members kind of doing much of the same thing,” Coleman said.
“What we wanted to do was bring that together.”
Coleman said that BFS had also looked to automate patching and updating code libraries. “Our teams are able to utilise a bot to ... patch environments across their fleet.” Coleman said.
“This enables them to stay up-to-date with the latest security advice and ensure that all our environments are patched to the right levels.”
On measuring everything, Coleman said the bank backed data-driven approaches “to everything” in its operations already
“All teams should know how their applications are performing and the key metrics about their platforms,” Coleman said.
“Having this data allows engineers to make informed decisions and guide the product performance and reliability, which are all critical for an excellent customer experience.”
Coleman said that monitoring and log metrics for upstream and downstream systems are open for examination by any engineer.
This had led to performance improvements in business- and customer-facing applications that were recognised by the heads of different business units.
Coleman said that the principle of measuring everything also “extends beyond just driving improvements.”
“It has allowed us to perform multicloud migrations with confidence,” he said.
“In this example our teams set up a multicloud Kubernetes platform. The investment in monitoring and logging has allowed us to move any amount of traffic between AWS and Google Cloud Platform dynamically. We are able to move across a small percentage of traffic, monitor it and dial up or down based on the feedback that we were receiving in real-time. Utilising this data allowed us to quickly validate and learn.
“Our logging and monitoring was also used for comparison monitoring of the two environments. We looked across multiple layers from the database to the microservices to the API gateway and everything inbetween, and we looked at differences across the HTTP metrics, network latency and application error logs over time.
“This allowed us to perform the large-scale migration of our core platforms with ease and with zero incidents, which again is quite an amazing achievement.”
'Share knowledge and be open to learning'
“In a team where there’s over 600 engineers, more often than not another team or engineer is facing the same problem as you, so sharing knowledge saves time but it also enhances learning and understanding,” Coleman said.
“We [also] ran an internal cross-skilling and professional development program for all our engineers called ‘Expand your T’, which is a reference to being a T-shaped engineer. This is led 100 percent by our engineers.”
Coleman added that Macquarie had also set up an internal cloud and DevOps ‘guild’ that had seen over 1000 staff so far obtain cloud certifications in AWS and GCP.