Yesterday from about 2:30pm PDT until 4:55pm PDT, we experienced a service degradation that caused our customers to see processing delays up to about 2 hours. While no data was lost, alerts were not being sent and new data was not appearing in the rollbar.com interface. Customers instead would see alerts notices on the Dashboard and Items page about the delay.
We know that you rely on Rollbar to monitor your applications and alert you when things go wrong, and we are very sorry that we let you down during this outage.
The service degradation began following some planned database maintenance, which we had expected to have no significant impact on service.
The Planned Maintenance
We store all of our data in MySQL in a master-master/active-passive configuration. Yesterday we needed to add partitions to our largest table - a routine procedure. Normally, this process takes about 15 minutes, during which time customers experience small delays in data processing. This process generally goes unnoticed by customers. However, this time something caused the database to load new data extremely slowly which, in turn, caused the outage.
2:06pm - We began the planned maintenance by promoting the passive master to be the new active master.
2:29pm - The planned maintenance was complete.
All connections to our old active master were closed and the new active master was getting new data and processing it.
2:40pm - It became apparent that new data was being loaded and processed very slowly.
We turned off our data loaders to decrease any contention in the database.
2:41pm - We began profiling the slow worker.
2:47pm - We tested a theory that a single recurring item was causing most of the slow processing.
~2:50pm - We noticed that ping times to the new active master were 1-5 milliseconds - an order of magnitude slower than normal.
2:52pm - We turned off replication from the passive to the active database which seemed to help loading data, but not by much.
~2:50pm - 3pm
ALTERto add partitions completed on the passive database.
- Up to this point, there were no connections on the passive database.
We decided to switch back to the previous active master but quickly reverted after finding that our passive database was missing a significant number of rows.
- MySQL replication was 0 seconds behind.
- It was unclear how the passive database thought it was caught up but was missing data.
~3:15pm - We identified a the slowest portion of the slow worker, which happened to be unused. We removed this code and deployed to all workers.
- This got processing back to normal speeds and allowed us to begin catching back up.
~3:30pm - We turned the data loaders back on.
4:30pm - The worker responsible to making new occurrences appear in the interface was caught up, but notifications were still delayed.
4:55pm - Everything was caught back up and we were back to 0 delay.
We have two open questions:
- Why did the data loaders slow down when we switched to the new active master?
- How did the databases get out of sync?
We have some theories as to why the data loaders slowed down so much but we are not sure. It could have been the amount of concurrent processes trying to load data into the same table. It could also have been something about the disk layout or cache on the new active master. We plan to investigate serializing loads in general and/or slowly ramping up loads after maintenance in the future.
To determine why our databases became out of sync, we wrote a script to tell us the exact moment when they diverged. Once it completes we will find the coordinates in the new active master’s binlogs that correspond with the point in time where the databases became out of sync, then restart replication on the passive master using those coordinates.
We take downtime very seriously and we want to be as transparent as possible when it happens.
We are sorry for the degradation of service and we are working on making sure it doesn’t happen again. If you have any questions, please don’t hesitate to contact us as firstname.lastname@example.org