The Australian Bureau of Statistics (ABS) rebuilt its incident response ahead of the 2021 Census, deploying a new technology platform to assist with escalation, status updates and post-incident reporting.
Assistant director for internet services Stephen Wellington told PagerDuty’s Connect virtual summit late last month that the ABS had moved “from a pretty low maturity” in the way it dealt with on-call responses to unfolding incidents.
The bureau deployed PagerDuty, a software-as-a-service incident response platform, about 18 months ago and has configured it both for the upcoming Census as well as to improve the way internal incidents are handled.
“Before we had PagerDuty we were very much an organisation where on-call alerts would just go out via text messages to our on-call responders,” Wellington said.
“There was no visibility that they’d seen the alert or that they were working on it.”
In addition, an on-call responder was only alerted to incidents in their immediate area of responsibility; they may be alerted to a server failure, but not to a network failure that was more likely to be the root cause.
The ABS has already rebuilt the digital services running the 2021 Census, and Wellington added that incident response needed to be similarly rebuilt to avoid a repeat of the handling of the 2016 Census.
The 2016 Census was taken offline by a series of denial-of-service attacks and then kept offline by behind-the-scenes incident response failures.
Wellington said the outage and fallout from the 2016 Census highlighted “very public failings”.
“That’s led realistically … to five years of gradual improvement, and particularly being better at incident response,” he said.
“I think that one of the key learnings was just that we weren’t prepared enough. We hadn’t practiced what to do in an incident.
“It wasn’t clear when things did go wrong exactly what had to happen, who had to talk to who, and of course a big public incident like that you’re straight away into the political side of things, into ministers offices. [You need] clear lines of what does need to happen in that situation.
“Those are the things I think we’ve really learned is [to] practice, to get that process right, and to be able to trigger those processes quickly.”
In the heat of a major incident, “the last thing you’re thinking about is going back to an incident plan and leafing through pages to see ‘who do I need to talk to now, who am I supposed to notify, what’s the next step in this process’,” he said.
The bureau is using ‘response plays’ in PagerDuty to map out and automatically activate a response.
Response plays are “packages of incident actions that can be applied at any time to an incident with just a single button click”, according to the vendor.
“This is the way to allow us to really quickly notify a bunch of stakeholders when we do have an incident,” Wellington said.
“A lot of our plans have a list of people at the start - these are the people that you need to notify.
“If you’re going through that the old fashioned way, you’re stepping through the plan, you’re making phone call after phone call, and by the time you get through all those phone calls, you’re 15 minutes in.
“With a response play, we can have that all pre-defined - in this type of incident we need to notify these people and we’ve got the list, and so whoever’s running the incident can trigger off a response play, and those people get notified.
“So we’ve taken that 15 minutes to even get the right people onto a bridge to start talking about it down to a minute or two.”
ABS has also applied PagerDuty to incident response for its own internal, non-Census IT systems.
Wellington said that response plays had a live test last year when the ABS suffered “a fairly major storage incident that affected a lot of systems and took a good couple of days to resolve”.
“We were able to make use not just of response plays but [also] stakeholder notifications to keep updates on an incident going as we worked through that resolution,” he said.
“For me in my incident management role, as well as coordinating the response and making sure that the tech people were in and working on the problem, I was keeping that communication channel to my executive and they were feeding it up into our organisational emergency and crisis team, so I was able to keep posting those updates in PagerDuty and have those notifications go straight through to our exec as to what ... the status was.”
Wellington said that PagerDuty’s logs were also useful when the ABS pulled together a post-incident report.
“Later on when we were wanting to put together a post-incident review and a report on what had happened, we had all that information timestamped in that log in the incident in PagerDuty,” he said.
“That’s the sort of space we want to be in when we get to Census.
“Obviously where we really want to be is we have no incidents and we don’t ever have to go through it for real, but we’ve practiced our incident response a lot before then.”
From an internal perspective, Wellington said the PagerDuty platform had created some competition between teams around how quickly their on-call responders acknowledged an issue.
It also helped the ABS mix up on-call response schedules, for example, to ensure that the same person was not always on-call in the week where monthly patches were applied.
“[Patch cycles] can cause a storm of alerts, and while we try and manage it with maintenance windows and things like that, it certainly is one of those ones where we do consciously try to keep an eye on, because if you have a team of four people rotating a week of on-call, and they’re in the same order, it seems to be the same person on every month when patching comes around,” Wellington said.
“That’s definitely something we try and manage, even around trying to keep those schedules a bit mixed up if we can, just so that there isn’t that same person getting hit every month type of scenario.”
Wellington added that PagerDuty had also enabled the ABS to manage systems that were particularly “noisy” when it came to generating alerts that didn’t necessarily require an action.
He said alert rules and delays on sending alerts had been set up “so we only get those alerts from the load balancers going out to an on-call person if it really is broken”.
“They’re the sort of things where there’s a momentary drop of a couple of seconds, it doesn’t actually have a [service] impact, but it would send an alert,” he said.
“That’s definitely something we’ve used PagerDuty to try and filter that noise.”