Fashion retailer The Iconic is no longer running quality assurance as a separate function within its software development process, having shifted QA responsibilities directly onto developers.
All existing quality assurance workers were recently moved into engineering roles within the business.
“We decided: we’ve got all these [developers] who are [coding] every day, and they’re testing their own work - we don’t need a second layer of advice on it,” head of development Oliver Brennan told the New Relic FutureStack conference in Sydney today.
“It just makes people lazy.”
The shift to making product teams “responsible for everything” was partly facilitated by a migration to the public cloud two years ago.
Product teams are now fully cross-functional and include “everything they need”, from developers to designers, data engineers, business analysts, and product owners.
But the ultimate responsibility for quality assurance of what's created lies with the software developers.
Such a move has the obvious potential to create problems should a developer drop the ball; to make sure the impact of any unforeseen issues is minimised for customers, The Iconic introduced feature toggles - allowing developers to turn off troublesome functionality without having to deploy new code.
Every new feature that goes into production must now sit behind one of these toggles, which dictates whether a user is served the new or old version of the feature in question.
The error rates between the new and old versions are then monitored for any discrepancies.
“We’re [also] in the process of rolling out new tools that say 'we’ve released this feature, slowly increase the percentage [of usage], and if the error rate is increasing, roll back to the old feature so we can fix the errors,'” Brennan said. This tool is facilitated by a monitoring platform made by New Relic.
While Brennan is no fan of “people breaking things”, he argues moving fast is more beneficial for customers.
He cites a recent bungle on October 11: when a product team rolled out a change to The Iconic’s image processing and "broke everything", it took the site almost an hour to recover.
“But it’s not the end of the world. It’s the recovery that’s important,” Brennan argued.
“Nothing’s wrong with breaking things - customers benefit more if we move fast. In some industries you can’t afford any downtime … We are lucky, we are an e-commerce retailer, we deploy probably 20 to 30 times a day. And if our site is down now, people will generally come back later.”
Dealing with alert overload
Another new area of responsibility for The Iconic’s software developers is monitoring: a trade-off for allowing devs to write in their language of choice is ensuring they also manage the monitoring of that feature.
As an online fashion retailer, The Iconic is heavily reliant on monitoring to ensure its customers can access the site whenever they choose, especially during peak sales periods.
As a result the firm has moved heavily into proactive - rather than reactive - monitoring, investing into much of the New Relic suite of products alongside other tools.
Its synthetic monitoring tool - a simulation of typical user behaviour or navigation through a website - is the “most important script in the entire stack”, Brennan said.
“If this script fails it triggers a P1 [priority one] and alerts everyone. It essentially goes through the whole site - through the homepage, select your product, checkout, sign in, purchase your product: we have to make sure you can get through to the confirmation page.”
New Relic synthetic monitoring is just one of a series of tools The Iconic uses to ensure its website is performing, and these combined add to around half a billion stored events - creating somewhat of a data headache for the IT team.
Brennan calls it “alert fatigue”.
“You have to make sure to sort through the data and work out what is relevant,” he said.
“This has been a big problem for us recently. We had so many alerts, and it gets to a point where my email is getting flooded, the New Relic app and all the other apps we use are going off - it’s a big problem.”
The Iconic has implemented two approaches to work around this.
First, it created two engineering streams dedicated to internal and external customers, which has meant only one person per stream ever needs to be on call, on a one-week rotation.
“They are responsible for the alerts. So the rest of the team can step back,” Brennan said.
On top of that, the retailer is currently in the process of aggregating all its alerts following a “whiteboarding process” that identified how many monitoring platforms it actually had.
“There were way too many. But we need them all,” Brennan said.
“We’re trying to use tools to pump all of our data and metrics into one system and trigger the incidents at the right times.
“So we’re saying if it’s a P3 or a P4 [alert], send it to the team that needs to deal with it for later - the engineer on call doesn’t care about it, it shouldn’t wake them up in the middle of the night. If it’s a P2 or a P1 that’s a more serious incident and we make sure the team knows about it.
“No-one wants to be woken up in the middle of the night. The Iconic only serves Australia and New Zealand, so if error rates start at 4am that probably doesn’t need to wake everyone up. If everything crashes and burns, that’s a different story.”
Chat tools also form part of this simplification: Brennan said he found that moving alerts out from email and into collaboration platforms like HipChat and Slack have made it less likely they’ll be ignored.
“ChatOps is your best friend. A lot of alerts come via email, and you can end up with an inbox full of stuff. And no-one reads it," he said.
“[But we found that] if you integrate this into HipChat or Slack or whatever, people glance at the alerts and don’t ignore them so much. So we have put a lot of stuff into HipChat, and we will [do so] with Slack [which] we’re moving to.”