Azure: Design for failure (Why?)

July 21, 2016 00:22

When applications are moved to the Cloud (in this case to Azure), people get perception that their applications are highly available and up all the time. Usually, most of the Azure services has SLA of 99.95% which also leads to the high availability perception.

Let’s consider a business scenario. Assume, we have an online shopping application deployed on Azure as Platform as a Service. Application consists of below Azure services

Azure Web App
SQL Azure
Blob storage

Conceptual architecture diagram of the application will look like this:

Assume application is deployed in the ‘North Central US’ Azure region.

What is wrong with this deployment? Everything should work fine and application should be available 99.95% of time. Here are few things to consider:

99.95% uptime is for individual service.
- This means if web app can be down for X number of hours and SQL database service can be down for Y number of hours. It’s NOT necessary both will down at the same time, they can be unavailable at different time
- So, even though each Azure service is within 99.95% SLA, your application might have more downtime!
What if your web application has some edge case failures?
- For example, your application has memory leak issue and in some edge cases, it gets restarted and after restart it takes few minutes to start serving requests.
- Such scenarios might add up to application downtime
- Consider such issues happening in busy season, which may give bad experience to your application users and might leave bad impression about your application
Application deployments
- Releasing new features or bug fixes to Production will make application unavailable for deployment duration
- Depending on how many times you do deployments, that time will add up to application downtime
Un-expected Azure issues may end up in application downtime

Because of above reasons, Design for Failure is important!

Let’s work on designing our application (above sample scenario) for failure.

First step

Find out failure points. In the scenario, we have 4 failure points:

Azure Web App
SQL Azure
Blob storage
Application bugs

Second step

Find solutions for handling failures.

Azure Web App

Solution 1

You can have multiple instances of web app running in same region. This means running web app in scaled out mode. You can run Azure Web App on up to 10 instances within same region

Issues with this approach
- If Azure region is having downtime then your application will face downtime

Solution 2

Deploy application in multiple region and manage traffic routing using Traffic Manager

Deploy application in other region (sister region). In our scenario, deploy web application to South Central US region
Use Traffic Manager to route users to appropriate instance of web application
Now even if North Central US region is having issue, user requests will be served from South Central US region. I will explain how to configure Traffic Manager below.

SQL Azure

Enable Geo Replication for SQL Azure database
Database is now actively geo-replicated to secondary region. Geo-replicated database is read-only
If primary region is having trouble, then make database from secondary region as primary (read/write) by stopping geo-replication. Connect application to database from secondary region.

Blob Storage

Choose your blob storage mode as ‘Read-Access Geo Redundant’ so that your blob storage content are actively geo-replicated to secondary region
In case of failure in primary region, you can connect applications to secondary region

After considering above solutions, newer application architecture will look like this:

Traffic Manager

Traffic Manager is used to manage routing of application users between multiple Azure regions. Traffic Manager can be configured in 3 different modes:

Failover mode
Performance mode
Round robin mode

In our case we will use Failover mode

Traffic manager needs ping URL to detect application is available or not. One ping URL can be configured per region.

If Traffic Manager sees ping failures from Primary (for certain time ~120 seconds) region then it automatically diverts traffic to next secondary region. When it sees primary region is up then Traffic Manager will divert traffic back again to Primary region.

You can find more information about Traffic Manager here

Validation time

Let’s revisit some failure scenario:

Azure Web App is facing issues
- Traffic Manager will redirect traffic to secondary region
- Once primary region is back then Traffic Manager will traffic back to primary region
- So, we are covered
SQL Azure service is facing issues
- Stop the replication
- Change database connection string to point to database from secondary region
- So, we are covered
Blob Storage is facing issues
- Change connection string to use blob from secondary region
- So, we are covered
Application issues
- This is the same scenario as Azure Web App is facing issues (#1)
- Traffic manager will address this issue
Application Deployments
- You can deploy applications to one region at a time and avoid application downtime during deployments too

After designing for failure, you may notice that application uptime is drastically increased! Now, applications are resilient to the Azure failures as well as application failures.

Moving applications to Azure –> Think about Designing for Failure!

Tags: Azure
Categories: Azure
comment

Sudarshan's Blog

Azure: Design for failure (Why?)

First step