Telstra is continuing to expand on a program of work aimed at boosting the resilience of its mostly internally-focused hybrid cloud infrastructure, and improving its incident response.
The telco is using a mix of cloud-native tools and PagerDuty’s digital operations management software, including event intelligence, for the work, IT cloud and automation technology leader Danilo Gonzalez told PagerDuty’s APAC Summit this week.
Gonzalez’s cloud enablement and infrastructure team is focused on “helping internal - and some external - customers to take advantage of cloud technologies, so they can leverage all of the benefits”.
“It is our responsibility internally to help them thrive in a secure, reliable and compliant manner, from development phase to the production phase,” he said.
“Our role in Telstra is to deliver well-engineered and resilient products and services. Easier management is critical to achieve this goal.”
Telstra runs a multi-cloud setup comprising multiple public clouds and an undisclosed number of on-premises clouds.
“When we started looking at the multi-cloud strategy as a whole, we saw that we needed visibility across [the clouds],” Gonzalez said.
That visibility was required, in part, to support a broader emphasis within Telstra on improving the resilience of its platforms and systems.
“In our team, we have the responsibility of enabling efficient ways to detect, respond and analyse platform incidents,” Gonzalez said.
“We needed greater visibility on the platform events that required intervention - or immediate intervention - from our DevOps teams, and this was critical to meeting the resiliency goals.
“When we started understanding the resilience that needed to be built in a multi-phased and multi-cloud approach, that's where we decided we needed an incident response tool … to gather all this information and act upon it.”
Gonzalez said Telstra settled on PagerDuty because it enabled faster detection and escalation of “high severity events, based on a subset of rules that we can define”.
The tool also came with out-of-the-box integrations for other monitoring, application performance tools, CI/CD and instant messaging tools used by the cloud enablement and infrastructure team, and to the clouds themselves.
Telstra’s specific use of event intelligence within PagerDuty - a capability the vendor added in mid-2018 - occurred when it was enabled in PagerDuty automatically.
“We found that we were using event intelligence when we suddenly stopped seeing a spike of the alerts that we were receiving,” Gonzalez said.
“It was a feature that was automatically enabled for us, and that's why we saw [the sudden drop].”
He later said that the tool culled the number of actionable events his team received a month from 200,000 to about 4000.
“We have decreased the mean time to respond from hours to just a handful of minutes,” he said.
Gonzalez described several use cases for PagerDuty.
One of the first ones was security compliance, whereby the tool is used to “detect security-related events and configuration changes within our cloud platforms”.
“We use an in-house security in-depth implementation using cloud native tools and PagerDuty,” Gonzalez said.
“Pagerduty was central in this. With this capability, now we are able to detect issues in near real time and act upon potential security related activities in our cloud platforms.
“With event intelligence, we were able to minimise noise and get to the root cause faster.
“As an example, we track required configuration drifts in near real-time with PagerDuty.
“Event intelligence groups all these events and gives us the location of the assets from the event scrubbing so we can tackle those faster.”
Telstra cloud platform engineer Afthab Abubacker said PagerDuty also has a role in a closed loop automation use case.
“As part of closed loop automation, what we plan to do is use PagerDuty's aggregation capabilities to aggregate events from different sources, do an event correlation and filter interesting events,” he said.
Events will be remediated either using automated scripts or be added to a backlog where they can be further analysed by one of the team.
“That's something that we have thought about and we just started working on,” Abubacker said.
Gonzalez said another use case being explored would help to improve the IT asset supply chain.
“We call [this] dynamic active inventory, and we do this by performing a single-touch IT asset discovery and configuration in a secure way,” he said.
“This will help us enable IT services in remote locations faster and more reliably, optimising the time to market and resiliency of our services.
“How PagerDuty comes into the mix [is that] it will help us manage incidents related to this capability and act upon these events.
“I see event intelligence helping us save time on analysis by providing insights on the goal.”
Telstra cloud and container engineering technology leader Vikram Nair added that PagerDuty will, in future, “play a bigger role” in Telstra’s operational IT activities.
“What we see in future is that PagerDuty will be very much a key central solution for developing hybrid digital infrastructure management capabilities within our teams,” Nair said.
“It means there's a live feed of performance counters, logs, alarms, and events coming from different network elements, which will all get aggregated and ingested into … an orchestration manager.
“What this orchestration manager will then do is make some dynamic decisions to move the workloads from a location to an alternative location, or redirect traffic to an alternate site so as to ensure that the service uptime and SLAs are met.
“To build this whole capability, PagerDuty will be a central key building block wherein we'll have full programmatic access to the events, alerts and drive intelligence out of this whole solution.”