Our Scalability project has reached success! We have eclipsed our goal of doing 50,000 registrations in 10 minutes – completing the task in only 7 minutes. This means RunSignUp is simply the most scalable registration system in the world – able to meet the needs of the largest races. And combined with our robust Participant Management features for bib exchange, event transfers and automated refunds and wait lists, RunSignUp is now simply the best registration and results system available, eclipsing old systems like Active.com.
Our project was inspired by the Broad Street race in Philadelphia this past spring where we faced frustrations in trying to register ourselves for the race (on a different registration system). Users were subjected to timeouts, no response and uncertainty as the system took over 5 hours to process 30,000 registrations.
During this year, we have methodically improved performance and implemented mechanisms that give us very robust capabilities. This has yielded benefits for every race since we are now twice as fast as Active.com, and the stability and availability of our service has been improved to be industry leading.
Here are the gory details:
We tested a user going thru a multiple step process that included most of our features to simulate a “worst case” scenario:
- Race Information Page
- SignUp Page
- User Information & Event Selection Page
- Team joining Page
- Giveaway selection Page
- Donation Page
- Store Page
- Payment Information Page
- Confirmation Page
To accomplish this test, we had to build a test suite that emulated a browser. The test suite includes a scripting component that let’s us mimic what a user would type in and see what the resulting page is. The test suite allows us to set the number of browsers to be emulated, the number of servers to be used to run the browsers, a variable wait time to emulate the time it takes a user to type in information on a page (we set this to be 15-60 seconds, which was done randomly for each user). The test also collected all of the information on each of these 450,000 page submits and replies checking for errors and calculating response time. The test servers all fed their information back to a central server which collated the information into a summary form that includes the graphs on this blog page.
We run on the Amazon Cloud, which makes it easy and low cost to add and deploy servers. We did a number of configurations, but most of our tests were run with the following number of servers and sizes:
- 20 m3.2xlarge instances, each emulating 2,500 runners.
- 4 h1.4xlarge Load Balancers running NGINX.
- 250 m1.large Apache Webservers.
- 1 m1.large RDS Server running MySQL with multi-zone auto-failover
- 2 m1.large RDS Read-Replicas running MySQL
- 4 c1.medium memcached servers for session cache
- 4 c1.medium memcached servers for data cache
This was distributed across two Amazon Availability Zones to help maximize availability in the event of an Amazon data center going down. We were also able to achieve similar results with 42 Webservers on the m3.2xlarge servers sized at 26 EC2 units with 30 GB of memory each.
We used other Amazon services as part of this that were critical:
- Amazon SQS – This queueing system was the key to eliminating the database bottleneck – allowing us to share connections to the database and have an ordering flow thru the system.
New Relic’s management capabilities provide us monitoring and deep insight into the system, allowing us to identify bottlenecks in our configurations and in our code.
The results are impressive and could have been much higher if we had eliminated the human wait time we built into the test to simulate a runner typing in information like their name and what giveaway to chose. It took just 7 minutes to process 50,000 runner registrations. We processed over 64,000 web pages per minute. Well over 400,000 runners could be registered in an hour.
The average wait time for the confirmation page was 10.5 seconds. For other pages, the average wait time was 1.1 seconds, with a range of 1.6 seconds for the first Race Info page and a low of 0.84 seconds for the donation page. The maximum wait time for a page to respond was 25 seconds for the confirmation page, and 26 seconds for the Race Info page. The average runner took 5 minutes and 2 seconds to register – completing all 9 pages from getting Race Info to getting a Confirmation.
We will be publishing data on results searching (for when runners, friends and family look up times after a race completes) in a separate blog post. We will also be publishing a complete white paper including some of our source code as well as our configuration and tuning parameters in detail in the next few weeks.
Large Race Offering
If you have a large race (over 5,000 runners, or sell out very quickly), we now are offering this scalability as a service. We can do this in a test mode (testing RunSignUp as well as other competitive services such as Active.com), for opening of registration and for posting results. We charge a set-up fee of $5,000 plus $25 per EC2 Unit per day. For example, if you have 15,000 people registering immediately upon race opening, we would recommend 75 m1.large (4 EC2 Units) servers (200 runners each) for a cost of $5,000 + $100*75 = $12,500. If you want us to pre-test the configuration, or to benchmark it against another registration service, we would charge the same amount.
Note that large races that need to handle huge loads for people searching results can also make use of this service, as well as our notification service. This is a common problem for large races that RunSignUp now solves!