We have been asked by a customer to document our disaster recovery plan, and we figured that it might be useful to document this for other customers as well as provide an example to other websites who might want to use a modern approach to these topics.
As with most modern apps, most of our Disaster recovery is actually built into the AWS services that we use and the code we write. If you are not familiar with Amazon AWS, please see https://aws.amazon.com/ for a review of features that are used by over 1 Million customers. In short, AWS provides the best hosting environment with the highest availability and security available. In addition, it allows us to expand and handle any load at a fraction of the cost of running our own data servers, or even having our servers hosted at other data centers. In other words, it allows us to focus on building a great app and leave the data center operations to experts.
Let’s review the highlights of our implementation below:
1. We have a highly available infrastructure on Amazon AWS. You can read more about it in our Runsignup Scalability & Availability Whitepaper. We will go into each of the tiers below, and also have more sections below that which describe other parts of our design, monitoring and plan. Our normal infrastructure can handle over 2,000 registrations per minute and can be easily expanded to handle 50,000 in 7 minutes.
1. A. Internet Facing: We use redundant NGINX servers to take all requests. We are always running 2 m3.large servers in separate AWS Availability Zones (AZ). These are deployed using Route53, which provides for automatic failover, as well as a variety of other features for high availability you can read about in the links. The Availability Zones are separate data centers that have separate Internet connections, power, and servers. So an Availability Zone (AZ) can go down, but it will not affect the other zones. For example in 2013 one of the Availability Zones went down and sites like AirBNB, Vine and others were affected, however we were not because all traffic shifted automatically over to the remaining AZ.
We have a system that allows us to expand the number and location of these services as shown below. Note we can make these changes in the AWS Console, but to improve the consistency of how we make changes we have developed our own UI within our system.
1. B. Web Tier. Like the Internet Tier, we run multiple servers in multiple AZ’s and can easily add or change the configuration. Our standard configuration today is 4 c4.2xlarge servers split between 2 different availability zones.
The NGINX front end servers do automatic load balancing between these servers, and also do automatic failover in the event that one or more servers are down. Note that for security purposes the web servers are on our own private network and not connected to the Internet.
1.D. Cache Tier. We use two levels of cache – one in the web servers themselves and a redundant set of data and session memcached servers. Cache enables us to fetch data without having to go to the database server, and memcached is the open source industry standard used by companies like Facebook and Google to optimize performance.
Much like the Internet and Web tiers, we run redundant servers in different availability zones. We are currently running 8 memcached servers for data and 8 for session. Each set of 8 is split between two AZ’s for redundancy.
This provides a highly redundant environment, and is also a key reason why our website is so fast.
1.E. Database Tier. We use Amazon RDS in the highly available format. This means we use replication to enhance availability and reliability for production workloads. Using the Multi-AZ deployment option means we run our databases with high availability and built-in automated fail-over from your primary database to a synchronously replicated secondary database in case of a failure. This means there is always a real time copy of our database in another availability zone.
We currently run our primary database on a db.m2.2xlarge. We also run a Read Replica and a Shard on separate db.m3.large instances. These are running in Multi-AZ mode, so there are redundant systems in another AZ.
We have been a beta test site for the new Aurora database service. The performance is outstanding. The availability features are also impressive. Each 10GB chunk of our database volumes are replicated six ways, across three Availability Zones. Amazon Aurora storage is fault-tolerant, transparently handling the loss of up to two copies of data without affecting database write availability and up to three copies without affecting read availability. Amazon Aurora storage is also self-healing. Data blocks and disks are continuously scanned for errors and replaced automatically.
1.G. Payment Services. We use 3 different payment services on the back end – Braintree Gateway, Braintree Marketplace and Vantiv. Each of these process billions of dollars per day and have highly available infrastructures. This allows for some disaster recovery potential in the case that one service was down for a significant period of time. Customers would have to create alternate payment accounts, but we would be able to continue processing on a per customer basis once the customer authorized the new payment gateway appropriately.
1.G. Other Services. We use Sendgrid for our Email Marketing service. They are the largest backend email service with over 180,000 customers and have a highly available infrastructure. For TXT messaging, we use CRMText and Twilio. These are configured to hold messages in a queue in the case of outage.
2. Backup. There are various levels of backup we have designed and implemented.
2.A. Race Participant Emergency Data. Starting 1 week before the race date for each race, we take a snapshot CSV spreadsheet copy of the participant data and store that on Amazon S3 and send an email to the race director. This makes sure that if something catastrophic happened to our system, your race would have the basic data available. S3 is simply the most highly durable and available service for storing data.
2.B. Database Backups. In addition to the RDS high availability, we take snapshots every 8 hours of the database and store that on S3. This would allow us to recreate the entire database quickly on another MySQL database server if RDS ever became unavailable. We use these snapshots occasionally to create databases on our test servers (we run multiple test servers on AWS so each developer basically has their own test environment to allow for high developer productivity).
When we move to Aurora, it provides incremental features. Amazon Aurora’s backup capability enables point-in-time recovery for our instances. This allows us to restore our database to any second during our retention period, up to the last five minutes. Our automatic backup retention period will be thirty-five days. Automated backups are stored in Amazon S3, which is designed for 99.999999999% durability. Amazon Aurora backups are automatic, incremental, and continuous and have no impact on database performance.
2.C. Code and Configuration Backups. We use GitHub as our shared code repository, and all of our developers have the total code base on their laptops. We have scripts that can recreate the entire environment within hours, and this is basically what we do when we create new test environment or new developers join the team. So even if our entire environment were wiped clean (say a nuclear blast in Virginia or a major solar flare EMP), we would be able to recreate it in another AWS region in another part of the globe as long as one of our East Coast developers was able to get to the Internet. If all AWS centers around the globe are down, then we would have some severe issues for our backup plans since we rely on a number of their services.
3. Monitoring and Corrective Action. Of course in spite of the high availability design of RunSignUp, things go wrong at times (example that caused 5 minutes of down time in 2014). The section covers how we know something is wrong and how we fix it.
3.A. AWS Console and Alerts. Amazon offers us a variety of alerts and integrated monitoring tools that send us emails or txt.
3.B. New Relic. This has been a very important tool for us to get alerts, monitor the site and gain insight on how we can improve performance. We wrote about it a while ago.
3.C. Nagios. We use this monitoring tool to drive some of the automated repair into our environment. For example, this tool allows us to get the fastest insight into our NGINX servers and we have automated scripts that will replace servers that are having issues. We recently did the same for memcached servers, which is useful since we have 16 of them and if there is a failure or one is not performing well we automatically take it down and replace it.
Ah, the power of the Cloud!