One of the key questions we’ve always asked of a cloud provider is whether the service comes with an enterprise service level agreements (SLAs).
An SLA offers users a guarantee, backed by financial penalties, that the service will be available for a given percentage of the month or year.
It has long been considered an ultimate measure of the maturity of a hosting or cloud service to offer an SLA of ‘four nines’ (available 99.99 percent of the time, or less than 52 minutes of downtime per year) or better yet ‘five nines’ (99.999 percent of the time, or downtime of less than 5.26 minutes per year).
Very few cloud providers actually hit anything like five nines but if the penalty imposed on the service provider cuts them deep enough, it’s at least an indication to the customer that they take availability seriously.
Curiously, in recent interviews with both foreign and domestic providers of cloud computing, I’ve been told that many customers don’t realistically expect their service provider to achieve these desired service levels.
Many customers, I am told, would rather not pay the additional cost a service provider is likely to charge for five nines -- otherwise, they would still be paying the premiums that traditional hosting companies charged.
Instead, customers view infrastructure-as-a-service as a commodity building block from which they can build their own availability and manage uptime themselves.
I recently asked Lance Crosby, CEO of US provider SoftLayer, whether the company continued to exclude outages of less than 30 minutes from its definition of downtime, as revealed in iTnews’ first ‘Cloud Cover’ report into contractual terms.
Crosby assured me SoftLayer had “tightened up” the practice, but that SLAs were not as critical as I may have imagined.
“When we went out with a four nines SLA, people started asking, why offer four nines for computing-on-demand? There is a whole new generation of developers and sys admins in the world that are scaling low and wide.”
By “scaling low and wide”, Crosby refers to the practice of consuming commodity virtual server instances across multiple availability zones or providers.
“We’ve had customers say: I don’t want managed services on my cloud environment,” he said. “If a cloud instance fails, I kill it and reboot. I start from another image and add the workload.
“They are saying to us: knock the cost off that SLA and I will solve the problem myself through multiple boxes, pods or data centres. Give me three nines and I’ll make it five nines myself.” Crosby said.
“It’s like the cloud is considered 'beta'; they expect outages and expect things to go wrong, and work around it rather than expecting it to be better.”
Virtual machines are so commoditised, Crosby said, that it’s cheaper to spin up a new one than guarantee the uptime of an existing one.
“The closest analogy I can make is with our hardware vendors - for hard drives, for example,” he said. “Hard drives are of a nominal value. We’re driving down the cost of hosting, so we tell the manufacturers: we don’t need a warranty.
"If it fails, we throw it away and get a new one. Because by the time we box up a failed drive and send it to the manufacturer, we are already losing money.
"It’s the same mentality people working with cloud have. They’d rather have those dollars back to create their own level of redundancy.”
Peter James, managing director of Australian public cloud computing provider Ninefold told iTnews that although customers have never explicitly asked for “a cheaper price for lower availability”, they tend to view any single virtual machine level as a building block from which availability can be built.
The great benefit of cloud computing, he noted, is that customers can choose to pay more for availability features above and beyond that base virtual machine as they see fit, and better yet from a transparent price list.
Without these value-added services in place, a customer relying on a single virtual machine from a single provider is likely to come spectacularly unstuck should their cloud provider suffer a technical failure.
Kristopher Sheather, founder of Canberra-based cloud provider Cloud Central, has reluctantly seen that in practice.
Cloud Central suffered a ten-day storage failure earlier this year.
The problem was an incompatibility between the storage software and additional disks deployed to expand storage capacity. It resulted in two drives from a single mirror pair degrading simultaneously and in further unexpected errors, the company reported on its status blog.
In accordance with its SLA and Cloud Central stumped up a 30 percent rebate for affected customers on the following month’s bill. But that did not appease those customers that had not set up their application to fail over to another provider.
Sheather takes full responsibility for the outage, but notes that “the key thing going forward is to make customers aware of their responsibility versus the cloud provider’s responsibility.
“You shouldn’t put all your eggs in one basket,” he said. “You should have your own DR [disaster recovery] strategies in place.”
What does this mean for the SLA?
Keith Price, director of Black Swan Consulting notes that an SLA simply offers the customer a rebate – but can’t actually guarantee that availability targets can be met. A discount on future services won’t help recover losses from a long outage.
The SLA isn’t any less relevant, he argues, it just needs to be framed by a new question.
Price recommends that customers ask any potential cloud provider to state their historical availability in minutes before signing up.
Over time, customers will seek out services to independently verify that data.
In the meantime, customers need to embrace new deployment models to ensure availability in a cloud world – using load balancers and multiple virtual machines, availability zones and/or data centres for any given application. Stay tuned to next week’s Cloud Cover to find out how it’s done.
Lia Timson contributed to this story.