Cloud outages happen – deal with it.

Yesterday, Tuesday 28th February 2017, Amazon’s S3 service from US-EAST-1 failed had increased error rates. Whatever you want to call it, it interrupted services across the Internet for many users. From Docker to Slack, from IoT to email. The fact it affected so many services wasn’t a fault of Amazon’s – it was a fault of the decision makers, developers and architects at the companies using Amazon’s services.

In a world where magical things happen in IT, where cool startups attract investors by the $millions, you can almost be forgiven that development gives little forethought for the possibility of an outage – especially one that has so many different variations of “9s” in its SLAs across all of its different services. However, dealing with established companies that should know better, leave a bitter taste in my mouth and makes me wonder what brought the company to run important services (to the point there are paying customers that expect a “just works” mentality) in the cloud in the first place – without an understanding of what to expect when things go wrong. Didn’t we all read the memo on “design for failure”? Despite this being an everyday staple of our IT diet, it came to light more when dealing with cloud services – one where applications should tolerate potential latent connections and services (noisy neighbours) and unknown patch schedules. So how, in 2017, are we still suffering outages because of one single entity of the whole cloud ecosystem failed?

Architects, developers and people who pay the bills: it is your responsibility to ensure services for your company are available, regardless of where that service runs. Cloud is there to allow you to consume services in an easy, digestible way. It potentially saves you time and money, considering a world where cables and server specs took most of the whiteboard, and where developers couldn’t care less about whether they’re getting a HP or a Dell box. It’s just a place where code runs, right?

Half-right.

The fact is, behind the rich veneer of cloud, there are still computers, cables and people. There is still physics connecting one part of the Internet and the other. There are still physical services sitting between your end-users and your pieces of code. That’s not to say you need to know what is going on behind the scenes – it’s more of an appreciation that someone, outside of your IT department, has solved the long-standing problem of being able to give you the services you need in a timely manner; giving you access to components without having to get a signature on a purchase order and waiting for it to be racked and stacked. Learn to appreciate that cloud meets you half way and that you have choices to make. Choices revolve around architecture, costs and operations.

Going all in with public cloud is not one that is a natural yes. For many businesses it is a no-brainer. For others it is longer journey. For some, it just isn’t a decision they need to make. After all, despite what the Clouderati want to impose, everyone gets to vote whether you agree or not with the end result. However, there are numerous public cloud success stories on the Internet, but that success wasn’t because they ticked the cloud box, it is because they employ savvy people to guide them into the best place for running their services. Architects who understand that a bridge is only as strong as its weakest point, developers who are solving problems through their end-users’s eyes – and not because some cool tech gets them to an even cooler conference, and key business people (from leaders, HR and Finance) who realise that cloud is a multi-faceted engine that needs the right fuel to function as intended. Public cloud only works when you understand its weaknesses. Regions cloudwash over the detail that there are datacenters still involved. So why did we assume we now no longer need to think about putting services into multiple locations? Cloud doesn’t solve this for you. The ability to do it is there, but developers and architects – you have to make it work yourself: understand what you have to play with, understand what limitations are, and most of all – what happens in the event of a disaster? I suspect people are having these conversations today after doing an S3 outage test in public yesterday. I also suspect people are hitting the AWS Cost Calculator hard today. You see, AWS has various tools that allow you to keep costs down on their platform. Making decisions around spot instances or reserved instances, require forethought and planning. Crossing regions, with variations in costs or further up-front reserved instance costs, change your budgets. If the cost of running multi-region AWS was greater than a “few mins of outage” from AWS, then some decisions force people to run from a single place. Is it time to review your insurance cost of the hit on your business in light of extended outages? Of course it isn’t all down to cost – what if there was an assumption that services would seamlessly run from another region – there are companies that are probably scratching their heads today wondering why services went down. We’re only human, we’re still infallible. There are a number of cogs in providing services to end-users: the cloud service provider is only one part of that puzzle.

US-EAST-1 is an interesting place – the one that tends to go on fire like this. It’s the default region when you sign up to AWS services, as well as the oldest region. How many times have you been itching to get started after you had your credit card accepted? Before you know it, you’re demonstrating the value of the cloud to decision makers and the adoption rate increases. At what point will you consider anything more than the single-region, blank canvas you’ve been painting in? I dare-say, businesses outside of the US (where end-users are not US-centric) forces this hand early, but it is sometimes an after-thought until it is pointed out.

Running in a public cloud, such as AWS, is extremely attractive, but many complex decisions lead to alternative choices such as multi-cloud (AWS, Azure and Google Compute Platform), or variations of that include hybrid (a mix of on-premises and public cloud). If people are not considering multi-region, then they have to make decisions around multi-cloud. Multi-cloud remove the dependency of a single service provider. AWS services are not infallible, else there would be 100% uptime SLAs on all their services. The fact they do not implies failure at some point, no matter how big or small: writing a number on an SLA does not automatically mean the code or operational procedures are airtight. Therefore, multi-cloud is about spreading risk. Multi-cloud is made possible when tools and development are abstracted away from specifics of a particular cloud service provider. For example, specifying APIs directly with one provider will exclude those scripts from working with one that doesn’t share the same syntax. Moving to a multi-cloud world suddenly unlocks you from the clutches of a single provider. In a non-cloud world, we used to call that lock-in. We cloudwash over this when thinking AWS because it is so for up and to the right in the Gartner Magic Quadrant(tm) that we think it’s ok to be there. If I was paying the bills, I’d worry what that kind of long-term tie-in might be for a smaller voice in the cloud world. This isn’t to send FUD about AWS – far from it. Savvy companies are realising the benefits of multi-cloud, and simply not risking all their eggs in one basket. If anything, it gives ultimate choice and flexibility and that should only be highly encouraged. It attracts diverse talent. It reduces reliance on specific CSP services. IT could even attract cost savings through negotiations knowing business are using multiple clouds – each trying to win one over the other for PR gain.

A variation on the multi-cloud is hybrid – where portions of services run as a private cloud in datacenters with fixed, controlled resources, as well as utilising public cloud services where it makes sense. It is a mistake to even begin to use the likes of OpenStack alone as a way of combatting the S3 outage. It is actually embarrassing. If you’re considering enhancing current datacenter operations, looking to revert an otherwise ambitious datacenter exit strategy and require on-premises resources for efficient delivery of services, or even looking at a datecenter exit strategy, OpenStack should be high on your list of technologies to help realise this plan. OpenStack, complementing services to VMware or more legacy infrastructure is immensely popular in unlocking access to cloud services to enhance internal IT services. However, utilising services such as OpenStack on-premises doesn’t solve any problem like the S3 outage alone because the S3 outage affected people who didn’t think multi-region. OpenStack, deployed across multiple datacenters is the counter-argument but consideration still needs to be made around apparent cloud-lock-in. However, most successful implementations of OpenStack have end-users already verse in using tools that abstract away from the underlying infrastructure: a key tenet of multi-cloud. Therefore, given that my experience and wages come indirectly from OpenStack, I get to see a massive potential for established Enterprises wanting to utilise cloud tools to speed up delivery of services and development, whilst taking a risk-free approach to cloud, by using OpenStack as an alternative platform, and immediately unlocking both on-premises and public cloud. It isn’t a completely solved problem. Cloud meets you half-way. The other half is filled by the same people who would solve any cloud-resiliency problem: architects, developers, decision makers.

AWS outages will happen. Azure outages will happen. OpenStack outages will happen. Deal with it because your customers will deal with those situations too by voting with their feet.

2 thoughts on “Cloud outages happen – deal with it.

  1. […] unreliable even Amazon doesn’t use it” and everything between. One of the most balanced reflections was actually written by my Rackspace colleague, Kevin […]

  2. […] unreliable even Amazon doesn’t use it” and everything between. One of the most balanced reflections was actually written by my Rackspace colleague, Kevin […]

Leave a comment