Webinar: Continuous Code Improvement Series | How to Improve with AI-Assisted Workflows Register Here Today!
April 11th, 2014 • By Brian Rue and Cory Virok
Yesterday from about 2:30pm PDT until 4:55pm PDT, we experienced a service degradation that caused our customers to see processing delays up to about 2 hours. While no data was lost, alerts were not being sent and new data was not appearing in the rollbar.com interface. Customers instead would see alerts notices on the Dashboard and Items page about the delay.
We know that you rely on Rollbar to monitor your applications and alert you when things go wrong, and we are very sorry that we let you down during this outage.
The service degradation began following some planned database maintenance, which we had expected to have no significant impact on service.
We store all of our data in MySQL in a master-master/active-passive configuration. Yesterday we needed to add partitions to our largest table - a routine procedure. Normally, this process takes about 15 minutes, during which time customers experience small delays in data processing. This process generally goes unnoticed by customers. However, this time something caused the database to load new data extremely slowly which, in turn, caused the outage.
2:29pm - The planned maintenance was complete.
2:40pm - It became apparent that new data was being loaded and processed very slowly.
~2:50pm - 3pm
~3:15pm - We identified a the slowest portion of the slow worker, which happened to be unused. We removed this code and deployed to all workers.
We have two open questions:
We have some theories as to why the data loaders slowed down so much but we are not sure. It could have been the amount of concurrent processes trying to load data into the same table. It could also have been something about the disk layout or cache on the new active master. We plan to investigate serializing loads in general and/or slowly ramping up loads after maintenance in the future.
To determine why our databases became out of sync, we wrote a script to tell us the exact moment when they diverged. Once it completes we will find the coordinates in the new active master’s binlogs that correspond with the point in time where the databases became out of sync, then restart replication on the passive master using those coordinates.
We take downtime very seriously and we want to be as transparent as possible when it happens.
We are sorry for the degradation of service and we are working on making sure it doesn’t happen again. If you have any questions, please don’t hesitate to contact us at firstname.lastname@example.org