Bunnings Group has undergone a transformation of the incident response mechanisms around its APIs and website, with the aim of reducing the time taken to identify and resolve problems.
Principal developer Eric Chapman told the PagerDuty Summit that Bunnings was “seeking observability” into its technology stack as well as ways to “visualise [its] product health”.
The retailer also wanted to “grow a culture of ownership” internally, where the engineer responsible for a product - or for making a change to it - could be quickly notified if and when a system issue arose.
“It's safe to say we have a lot of product changes, and a lot of new lines of dependencies between products are emerging all the time,” Chapman said.
The first place that Bunnings set about improving incident response was around APIs that the retailer’s website relied on.
Incident management as a discipline has always existed internally, but there were technical gaps in the way it was run.
“Our team really does come together to solve problems. Team members from all areas will assist in whatever means are at their disposal to help address an incident,” Chapman said.
“We have a lot of logging, we have a lot of people available to support if anything should go wrong, but we had no easy way to alert them to an issue without somebody actually reviewing the logs and understanding them and who to actually call.
“What we were seeking to do is to reduce the time it takes for an incident to reach the correct engineer.”
Bunnings also wanted a way to map interdependencies between systems, measure mean time to identify (MTTI) and mean time to resolution (MTTR), and to keep its “ITSM [IT service management] and APM [application performance management] tools up-to-date with the latest information in regards to an incident.”
“The original solution to that was to create alerts within our ServiceNow [ITSM] platform, stating who is to be called for that alert, but that's not very scalable,” Chapman said.
“We went out to market and came across PagerDuty, and after completing the usual corporate procurement dance, we received the keys and got straight to work.”
Bunnings has created “tight integrations” between PagerDuty and other systems used for incident response.
The retailer uses the PagerDuty system to help “correlate issues by recognising that there might be a similar issue at the same time” that is related to one that causes an alert to be generated and investigated.
“It really also assists in narrowing down the correct engineer [to send the problem to],” Chapman said.
He added that “PagerDuty has helped [Bunnings] become a lot more confident in our services and provided us with a truth of source from an engineering lens on a technical service and its status.”
Chapman said it was “still … early days” for PagerDuty, with three squads within engineering and service making use of the tool to date.
However, he flagged the possibility of further expansion in the use of the tool.
“For the future, we're really going to start rolling this out across the other teams outside of engineering and service desk over the coming months,” Chapman said.