Yesterday from 2:20am PST until 10:22am PST, we experienced a service degredation that caused our customers to see processing delays reaching nearly 7 hours. While no data was lost, alerts were not being sent and new data was not appearing in the rollbar.com interface during this time.
We know that you rely on Rollbar to monitor your applications and alert when things go wrong, and we're very sorry that we let you down during this outage. We'd like to share some more details about what happened and what we're doing to prevent this kind of issue from happening again.
Yesterday from about 2:30pm PDT until 4:55pm PDT, we experienced a service degradation that caused our customers to see processing delays up to about 2 hours. While no data was lost, alerts were not being sent and new data was not appearing in the rollbar.com interface. Customers instead would see alerts notices on the Dashboard and Items page about the delay.
We know that you rely on Rollbar to monitor your applications and alert you when things go wrong, and we are very sorry that we let you down during this outage.
The service degradation began following some planned database maintenance, which we had expected to have no significant impact on service.
tl;dr: from about 9:30pm to 12:30am last night, our website was unreachable and we weren’t sending out any notifications. Our API stayed up nearly the whole time thanks to an automatic failover.
We had our first major outage last night. We want to apologize to all of our customers for this outage, and we’re going to continue to work to make the Rollbar.com service stable, reliable, and performant.
What follows is a timeline of events, and a summary of what went wrong, what went right, and what we’re doing to address what went wrong.