October 25th, 2016 • By Brian Rue and Dan St. Clair and Cory Virok and Ken Sheppardson
We had a significant outage this past Friday, October 21. Many customers were not able to reach rollbar.com or api.rollbar.com, some data was lost, and a few customers experienced cascading issues caused by our API outage. The root cause was a Distributed Denial of Service (DDoS) attack on our DNS provider.
We weren't the only service that had issues on Friday, but that is no excuse—we know that our customers rely on Rollbar as a critical monitoring service and it's all the more important that we are up when everything else is down. We'd like to share some of the details about what happened and what we're doing to prevent this kind of issue from happening again.
All times PDT.
During the outages:
The problems that this outage exposed were:
We had some problems communicating with customers during the outage:
In response to Friday's events and what we've learned since:
We are re-evaluating our DNS setup so that we will be resilient against failures on this tier, likely by using multiple DNS providers. Since we use advanced DNS features like dynamic routing, this is not quite as simple as it may sound and we're doing some research first.
We are adding monitoring coverage for DNS.
We're moving our status page from status.rollbar.com to rollbarstatus.com, which will use a different DNS provider than our primary domain.
We will be investigating how our client libraries handle network failures and audit them to ensure that the impact to the host application is minimal in the case of a network outage.
For any customers who are looking for an immediate solution to this problem, we'd like to refer you to rollbar-agent to move the network request out of the application and into another process that reads queued events from disk. This also has the benefit that requests will eventually succeed after a network outage. This model is powerful, but currently fairly difficult to set up. We'll be scheduling some improvements to rollbar-agent soon.
There has been a lot of chatter around the web about this outage. Here are a few links that we found helpful or interesting:
We know our customers depend on Rollbar as a critical, reliable piece of infrastructure, and we're sorry that we did not meet this standard on Friday. If you have any questions, don't hesitate to contact us at email@example.com.