DDOS – What a Distributed Denial of Service Attack Looks Like

We had a DDOS attack on RunSignUp today.  It lasted from about 4:00PM until 4:45PM when we were successful in cutting it off. It averaged over 1,000 requests per second.

screen-shot-2016-10-03-at-9-32-17-pm

The attack was looking for vulnerabilities like SQL injection. This slowed down the average response time on our website about a half second from about 2.7 seconds to about 3.2 seconds:

Screen Shot 2016-10-03 at 9.30.28 PM.png

We will continue to watch closely for additional attacks and do everything we can to mitigate any delays or issues. We are fortunate that our Amazon AWS infrastructure is so scalable and high performing and alerts us when these types of issues occur. We also know that Amazon is also working with us on addressing these types of bad actors on the Internet.

Sunday Afternoon Errors

UPDATE: We have made two changes to our system to try to improve this situation if it happens again. First, we have adjusted the settings on our New Relic Monitoring tool to try to catch this type of error so we can respond more quickly than hearing about errors from users. Second, we have added a new queue that will save data before sending the transaction for payment. This will allow us to recover transactions more easily in the event something like this happens again.

One of our servers had a hardware failure on Sunday afternoon. Unfortunately, this created a situation where the credit card transactions were sent and completed, but the registration was not completed in the system. There were a total of 363 transactions with this error out of 6.506 total transactions that completed yesterday.

We are processing refunds for all of those transactions this morning, although people may not see them for a day or two depending on their credit card company.

The error people saw when using RunSignUp was either an error when trying to log into their account, or when they clicked on the final page when they entered their credit card number, the confirmation page never showed. If a participant got a confirmation page or email, or you see them in your Race Director Participant Report, then their transaction completed successfully.

We apologize for this error. Our monitoring system, New Relic, was supposed to pick up errors like this, but did not. We are doing further investigation, and will be adding some extra monitoring and logging to try to catch errors like this much earlier and automate backup processes. We will report on the final changes we are making to the system when they are implemented.

Infrastructure Improvement Summary

fast-ronWe try to improve our infrastructure in Q1 of each year to make sure we do not build up technical debt and stay on top of the most current trends in technology. We have four goals when we look at this each year:

  • Improve Availability – reduce the chance our systems will go down, and in the event of real issues that we recover in as automated and quick way as possible.
  • redundancyImprove Speed – it is well know that a tenth of a second is meaningful on e-commerce sites. We want to make your race registration as fast as possible to give your participants a great experience and get them thru checkout as quickly as possible.
  • Improve Security – make sure your data and participant data is secure.
  • Improve Scalability – make sure if there is a crush of people signing up for races, that we handle the load without performance loss.

We have done a wide variety of changes the past 2 months we have been working on this. Probably the most visible and significant are the  0.5 Second page speed progress we have made:

January 11-17, 2016:
Screen Shot 2016-03-23 at 2.03.25 PM.png

March 14-20:
Screen Shot 2016-03-23 at 2.03.02 PM.png

Remember, this is a blended rate across Desktop and Mobile (about 52% Mobile Phone access). While a Half Second difference does not seem like a lot, studies have show it can make a 7% difference in conversion. Reference 1, Reference 2. And perhaps more importantly, page speed can affect your Google Ranking – so RunSignUp speed is one of the reasons why race websites on RunSignUp rank so highly.

Screen Shot 2016-03-10 at 11.19.37 AMHere is a list of the significant improvements we have made:

  • Aurora Database Upgrade – This is probably the most important upgrade we made since it improved speed, availability, and scalability. The chart on the right shows the decrease in time for some of the database calls that we make – dropping from an average of 40 milliseconds (0.040 seconds) to less than 10 milliseconds.
  • Screen Shot 2016-03-10 at 11.20.11 AMHardware Upgrades – we upgraded a number of our hardware components to faster, more modern equipment. This was largely responsible for the reduction in the web server response time shown on the right.
  • Optimized Page Load Time – we made a number of changes, like optimizing jQuery library downloads, reduces CSS size, asynchronously load Facebook, removed
    Screen Shot 2016-03-10 at 11.30.36 AMAddThis share and replaced with better share options, and more. This led to drops in page load, which is especially important for mobile users on slower connections.
  • Optimized Database Backups – the move to Aurora gives us database backup on a per minute basis. In addition, we added the capability to store permanent snapshots of the database.
  • Improved Availability for Batch Jobs – we created better mechanism for running our routine batch jobs in the event the main batch job server is down.
  • Failover for Read Replica Database – Providing higher reliability in the event a read replica database becomes unavailable.
  • Software Upgrades and Logging Improvements – we updated to current versions of our core software components and added better long term storage for logs.
  • Upgraded Load Balancer Error Handling – we now detect issues in the load balancers better, and have a smoother failover capability to users should see no disruption in service if there are issues at this tier.
  • Improved Participant Report Performance – we made changes to the database and queries to optimize one of the most commonly user reports – the Participant Report. On fast connections you can now see sub-second response on reports even when you have over 20,000 participants.
  • Security Improvements – we will not talk publicly about these.

We see our investments in modern infrastructure to be a core value (along with processing money efficiently and providing features to improve your race) we can provide to races. Most of these improvements are beyond the capabilities of most races (and even most race registration companies), but they are critical for providing the best platform for you and your participants.

Updated Database Backups

redundancyAs part of our big infrastructure upgrade, we have improved our processes for database backups that are in addition to the AWS Aurora automated backups. The Aurora backups are good from a reliability point of view with features like enables point-in-time recovery. However, the backup is only good for 35 days and the time to recover is fairly long.

We have implemented a weekly full database backup and then incremental updates every half hour, which are all stored permanently. This will allow us to go back to any point in time if that is ever necessary. It will also allow us to deploy a new instance much more quickly if there is ever a disaster that we need to recover from.

Improved Background Jobs Availability

redundancyThere are a number of processes that run on a scheduled basis in a big system like RunSignUp – these are called Chronjobs. They are typically run on one server in most environments, and that is how we did things until today.

As part of our infrastructure upgrade, we have added the ability for these chronjobs to run on any server in case the primary server is not available. Another small improvement in our infrastructure that you will never notice (but are lucky to not have to worry about)!

Failover for Read Replica Database

Fast RonYes, we know these infrastructure updates are kind of boring. But we want to document them.

This one involves if there is a loss of connection to the read replica database(s). It now auto-failovers to the master database. We had two instances of this happening over the past year and wanted to handle it in a smoother automated way.

We also added a Read Replica Lag Detector. If we detect that the read replica is more than 5 minutes behind in updates from the main database, then we failover to the main database and await recovery.

More Performance Improvements

Fast RonWe continue to do our annual big infrastructure update. Improvements to date have decreased average page load time across all devices (over 50% are mobile phones) from about 3.2 seconds to 2.85 seconds. While that does not sound like a lot, that is 980,000 seconds (11 days and 8 hours) of wait time last week across the 2.8 Million Page Views on RunSignUp.

Here are the latest updates:

  • Switching from Google’s Content Delivery Network to CloudFlare’s CDN for jQuery libraries. This reduces the time it takes to load these libraries into browsers.
  • Combined <link> tags for Google Fonts to improve performance.
  • Reordered our HEAD script block to optimize performance.
  • Load the AddThis share components after the page has loaded. We will likely replace these in the coming days to further optimize performance.
  • Updated Facebook to load asynchronously.
  • Cleaning up old CSS from our CSS files to decrease load size on each page by 20-25%.
  • Moved some database queries from the main database to a shard.

These, combined with other improvements have dropped our average page load time