One of the early disaster recovery mechanisms we built into RunSignUp is saving a full list of your participants in a CSV file once a day starting the week before your race. We send race directors an email link when this is first created each year. In the event RunSignUp is not available (knock on wood, we had zero downtime last year and only 8 minutes of planned downtime total in 2015), you can click this link to get access to your participant information.
We have added a link to this in the RunSignUp dashboard, which is useful in case you misplace your email and want to practice your own disaster recovery procedures. The report can be found from the bottom of Reports > Participants > Summary:
If you check more than 7 days prior to your race, there will be no file:
If you check 7 or less days before your race, you will see a link:
Amazon, the leading Cloud provider had some major issues today. The screen shot on the right shows their status page around 1PM today – and RED is not good. Even USA Today, and CNN reported the 4-5 hour outage this afternoon:
“People reported outages and delays on services like Slack, Trello, Sprinklr, Venmo and even Down Detector”
This is the last day of the month – a day in the race industry is usually pretty big because a number of races have price increases. About 25,000 people signed up for races today and RunSignUp processed over $1 Million in transactions. If we had been down like many other sites it would have been a very bad day for us and our customers.
The reality is that no technology is perfect – even a huge company like Amazon goes down. And while we may have had a perfect track record last year, I am quite sure we will have problems at some point in the future.
BUT, we do spend a lot of time and money and talent trying to figure out how to make our platform reliable and redundant and available. Some of that expertise and investment paid off today for us and our customers. It won’t always be perfect, but our investments in technology are meaningful. And we take a great deal of pride in them and our amazing development team.
We had a DDOS attack on RunSignUp today. It lasted from about 4:00PM until 4:45PM when we were successful in cutting it off. It averaged over 1,000 requests per second.
The attack was looking for vulnerabilities like SQL injection. This slowed down the average response time on our website about a half second from about 2.7 seconds to about 3.2 seconds:
We will continue to watch closely for additional attacks and do everything we can to mitigate any delays or issues. We are fortunate that our Amazon AWS infrastructure is so scalable and high performing and alerts us when these types of issues occur. We also know that Amazon is also working with us on addressing these types of bad actors on the Internet.
UPDATE: We have made two changes to our system to try to improve this situation if it happens again. First, we have adjusted the settings on our New Relic Monitoring tool to try to catch this type of error so we can respond more quickly than hearing about errors from users. Second, we have added a new queue that will save data before sending the transaction for payment. This will allow us to recover transactions more easily in the event something like this happens again.
One of our servers had a hardware failure on Sunday afternoon. Unfortunately, this created a situation where the credit card transactions were sent and completed, but the registration was not completed in the system. There were a total of 363 transactions with this error out of 6.506 total transactions that completed yesterday.
We are processing refunds for all of those transactions this morning, although people may not see them for a day or two depending on their credit card company.
The error people saw when using RunSignUp was either an error when trying to log into their account, or when they clicked on the final page when they entered their credit card number, the confirmation page never showed. If a participant got a confirmation page or email, or you see them in your Race Director Participant Report, then their transaction completed successfully.
We apologize for this error. Our monitoring system, New Relic, was supposed to pick up errors like this, but did not. We are doing further investigation, and will be adding some extra monitoring and logging to try to catch errors like this much earlier and automate backup processes. We will report on the final changes we are making to the system when they are implemented.
We try to improve our infrastructure in Q1 of each year to make sure we do not build up technical debt and stay on top of the most current trends in technology. We have four goals when we look at this each year:
- Improve Availability – reduce the chance our systems will go down, and in the event of real issues that we recover in as automated and quick way as possible.
- Improve Speed – it is well know that a tenth of a second is meaningful on e-commerce sites. We want to make your race registration as fast as possible to give your participants a great experience and get them thru checkout as quickly as possible.
- Improve Security – make sure your data and participant data is secure.
- Improve Scalability – make sure if there is a crush of people signing up for races, that we handle the load without performance loss.
We have done a wide variety of changes the past 2 months we have been working on this. Probably the most visible and significant are the 0.5 Second page speed progress we have made:
January 11-17, 2016:
Remember, this is a blended rate across Desktop and Mobile (about 52% Mobile Phone access). While a Half Second difference does not seem like a lot, studies have show it can make a 7% difference in conversion. Reference 1, Reference 2. And perhaps more importantly, page speed can affect your Google Ranking – so RunSignUp speed is one of the reasons why race websites on RunSignUp rank so highly.
Here is a list of the significant improvements we have made:
- Aurora Database Upgrade – This is probably the most important upgrade we made since it improved speed, availability, and scalability. The chart on the right shows the decrease in time for some of the database calls that we make – dropping from an average of 40 milliseconds (0.040 seconds) to less than 10 milliseconds.
- Hardware Upgrades – we upgraded a number of our hardware components to faster, more modern equipment. This was largely responsible for the reduction in the web server response time shown on the right.
- Optimized Page Load Time – we made a number of changes, like optimizing jQuery library downloads, reduces CSS size, asynchronously load Facebook, removed
AddThis share and replaced with better share options, and more. This led to drops in page load, which is especially important for mobile users on slower connections.
- Optimized Database Backups – the move to Aurora gives us database backup on a per minute basis. In addition, we added the capability to store permanent snapshots of the database.
- Improved Availability for Batch Jobs – we created better mechanism for running our routine batch jobs in the event the main batch job server is down.
- Failover for Read Replica Database – Providing higher reliability in the event a read replica database becomes unavailable.
- Software Upgrades and Logging Improvements – we updated to current versions of our core software components and added better long term storage for logs.
- Upgraded Load Balancer Error Handling – we now detect issues in the load balancers better, and have a smoother failover capability to users should see no disruption in service if there are issues at this tier.
- Improved Participant Report Performance – we made changes to the database and queries to optimize one of the most commonly user reports – the Participant Report. On fast connections you can now see sub-second response on reports even when you have over 20,000 participants.
- Security Improvements – we will not talk publicly about these.
We see our investments in modern infrastructure to be a core value (along with processing money efficiently and providing features to improve your race) we can provide to races. Most of these improvements are beyond the capabilities of most races (and even most race registration companies), but they are critical for providing the best platform for you and your participants.
As part of our big infrastructure upgrade, we have improved our processes for database backups that are in addition to the AWS Aurora automated backups. The Aurora backups are good from a reliability point of view with features like enables point-in-time recovery. However, the backup is only good for 35 days and the time to recover is fairly long.
We have implemented a weekly full database backup and then incremental updates every half hour, which are all stored permanently. This will allow us to go back to any point in time if that is ever necessary. It will also allow us to deploy a new instance much more quickly if there is ever a disaster that we need to recover from.
There are a number of processes that run on a scheduled basis in a big system like RunSignUp – these are called Chronjobs. They are typically run on one server in most environments, and that is how we did things until today.
As part of our infrastructure upgrade, we have added the ability for these chronjobs to run on any server in case the primary server is not available. Another small improvement in our infrastructure that you will never notice (but are lucky to not have to worry about)!