Real estate giant REA Group made it through the recent Amazon Web Services Sydney availability zone outage relatively unscathed thanks to a multi-region and multi-availability zone cloud architecture.
Earlier this month one of AWS' Sydney availability zones went under after bad weather triggered a failure in the company's uninterruptible power supply (UPS) setup.
The outage sent some of Australia's biggest web properties scrambling when EC2 and EBS instances in the AZ became unreachable and other services including Elastic Search, APIs and internal DNS experienced flow-on problems.
REA Group, a heavy user of AWS services, was one of those affected, but managed to get away with only a broken ad server, one offline web app, a wobbly Android application and slightly slower response times for some services.
" ... while we weren’t totally unaffected, it was overall a satisfying outcome," senior technical lead Jeremy Burton said.
Being prepared .. and lucky
While the outage has prompted many to reconsider their cloud architecture, REA Group says designing for failure - coupled with "a bit of luck" - helped it weather the storm.
REA's production systems are deployed in a multi-availability zone setup by default. Its most critical systems - as well as those like Redshift that don't offer multi-AZ options - have been architected to run across multiple regions, specifically in Frankfurt and Sydney.
The IT team operates independent copies of the systems that interact with REA's master data store in each region for eventual consistency, Burton said.
"The only thing that will be common is the source of the data," he wrote.
"In this way, if one region has problems, the other is totally unaffected."
API clients can talk cross region if local copies aren't available, Burton said, using a combination of AWS Route53 latency routing and Route53 health checks.
This approach kicked in during the recent Sydney AZ outage - "one of our services automatically flipped over to our European region when some of its instances had problems," Burton said.
Additionally, continuing to host some of its core systems within its inhouse data centre and deploying straight to S3 for static assets helped REA avoid severe downtime.
"S3 by its nature is more durable than an EC2 instance, and more likely to survive an AZ failure," Burton said.
"It’s multi-AZ by default, and while the events of the weekend have shown that just being mutli-AZ isn’t necessarily enough to be resilient to an AZ failure, the S3 service held up well."
Deep pockets required
However, be prepared to see your infrastructure costs double when adopting a multi-region approach, Burton warned.
"It takes well-architected systems to function under eventual consistency, and to be decoupled in a way that allows redundancy in appropriate parts of the infrastructure," he said.
"Making your infrastructure immutable comes at some automation cost.
"And in some cases, it’s just not worth it. Either the SLAs don’t indicate a need for multi region, or the system isn’t critical enough to justify the engineering or infrastructure expense."