AWS Sydney outage prompts architecture rethink

By on
AWS Sydney outage prompts architecture rethink

Customers consider multi-region redundancy.

Last night's outage to an Amazon Web Services Sydney availability zone is prompting some of AWS' biggest local customers to reconsider their architectures to mitigate future damaging downtime.

AWS has built its brand on reliability as well as flexibility and cost, but yesterday's storms in Sydney showed that even the public cloud powerhouse isn't immune to nature.

Big-name web properties spent Sunday night scrambling after the bad weather fried hardware in one of Amazon's Sydney data centres, sending EC2 and EBS instances in one of its availability zones offline and creating problems for other AWS services including Elastic Search and internal DNS.

API call failures in the affected availability zone also meant that those hosted there were unable to failover elsewhere, despite having multi-zone redundancy in place for such events.

However, some fared better than others.

The likes of Carsales, Domain, The Iconic, Domino's and REA Group were among a laundry list of major players affected by the outage. (AWS' popularity in Australia forced the company to build two of its own data centres in the city after it outgrew co-lo space just 18 months following its local launch).

Domain, The Iconic and Domino's experienced extended downtime, where Carsales and REA Group had minimal impact.

Carsales' use of its own native APIs rather AWS' offering, and the fact that it hosts parts of its site in Azure, meant the company escaped with a slower, but still fully functional, site for a small amount of time.

The trade-off for this more architecturally-tricky arrangement is slightly more cost and planning ahead of time, Carsales CIO Ajay Bhatia said.

"The only sure way not to have an outage is not to be online, but second best is to have a balanced plan with a bit of luck," Bhatia said.

"One thing about Carsales is with our model, for example, with dealers where we only charge them when consumers send leads so any outage means we can't charge, so it is super important that we minimise outages."

REA Group has both multi-zone and multi-region failover in place. It deploys to two availability zones simultaneously, so wasn't impacted by the API difficulties.

It was able to get away with just one lost web page that was hosted in a single availability zone and a wobbly Android app because the IT team reloaded immediately onto another zone, and controlled its elastic load balancing to stop its sites going back to the struggling data centre.

"Multi AZ and ultimately, multi-region, with some smart architecture for deployment is key to cloud resilience today - [as is] having a team of world-class engineers manage the impacts in real time," REA CIO Nigel Dalton said.

"We learned a lot. Power failure is a tough event for anyone to suffer, and we have an A-team of engineers. Others will be learning different, tougher lessons about good AZ management."

Going global?

The impacted enterprises iTnews spoke to said they were now looking at how to shore up their infrastructure against another similarly damaging outage.

But the events of last night don't appear to have deterred them from jumping in bed with a single cloud vendor - rather, they're now looking at redundancy across geographic regions.

"There are more lessons for us. Hopefully [this] will make us better from here like I am sure [it is] with many companies. That is the benefit of such outages - it makes you think you what you could do better," Bhatia said.

"... multi region is more important than it was a day ago. I am careful not to make a decision yet though without looking into the full picture that the team must provide now."

Domain CTO Mark Cohen said it was "very very likely" last night's problems would change how his team structured its use of AWS.

"We have a post mortem today and we'll be looking at a couple of plans of attack."

Domain is heavily embedded with AWS, making a multi-cloud environment somewhat difficult. Cohen expects the IT shop will move to a multi-region architecture that makes more use of tools like Chaos Monkey.

Cloud specialist Jeff Waugh said multi-region would be a much more attractive proposition for many organisations than using several vendors.

"You could go multi-region or multi-cloud, but I think multi-region makes a lot more sense - it's a much easier proposition if you're using the same technology stack and then if something terrible happens to all of Sydney you can failover to Singapore," he said.

"Hopefully this will make people a bit more introspective about how they structure their architecture."

Copyright © iTnews.com.au . All rights reserved.
Tags:

Most Read Articles

Log In

Username:
Password:
|  Forgot your password?