By
Brian Rue, Dan St. Clair, Cory Virok, and Ken Sheppardson

Postmortem from outage on October 21, 2016

We had a significant outage this past Friday, October 21. Many customers were not able to reach rollbar.com or api.rollbar.com, some data was lost, and a few customers experienced cascading issues caused by our API outage. The root cause was a Distributed Denial of Service (DDoS) attack on our DNS provider.

We weren't the only service that had issues on Friday, but that is no excuse—we know that our customers rely on Rollbar as a critical monitoring service and it's all the more important that we are up when everything else is down. We'd like to share some of the details about what happened and what we're doing to prevent this kind of issue from happening again.

Outage Timeline and Resolution

All times PDT.

  • Approximately 4:00 a.m.: DDOS began against Dyn, Rollbar's DNS provider
  • 4:00 a.m. - 6:00 a.m.: A few Rollbar customers noticed problems and contacted our support team
  • 6:00 a.m.: First attack is mitigated by Dyn
  • 8:00 a.m.: We became aware of the first attack, but noted that it had ended.
  • 9:15 a.m.: We noticed timeouts reaching rollbar.com and that a second wave of the attack was under way.
  • 9:23 a.m.: We updated our status page about the issue
  • 9:40 a.m.: We began migration to an alternate DNS provider (AWS Route 53)
  • 10:40 a.m.: Configuration changes for the DNS migration were complete; however, we now had to wait for the nameserver change to propagate
  • 11:07 a.m.: We updated our status page, as well as Twitter and Facebook, regarding the migration and posted static IP addresses that customers could use in the meantime.
  • Over the next 6-12 hours, DNS traffic moved over to AWS and all customers were able to resolve rollbar.com and api.rollbar.com

Customer Impact

During the outages:

  • Many customers were not able to reach rollbar.com
  • Many customers' applications (servers, web and mobile clients) were not able to reach api.rollbar.com to send data, leading to data loss as those requests failed and, in some cases, application problems caused by the slow requests

What We Learned

The problems that this outage exposed were:

  • DNS was a single point of failure. While Dyn has been an extremely reliable provider over the past 4+ years, Friday's outage showed that they are not infallible.
  • Using DNS for geographic load balancing and failover made us more susceptible to an outage like this, since we use short TTLs and have a generally more complex DNS setup.
  • Our monitoring systems did not detect this kind of outage. Traffic was normal and all internal systems were normal. We only learned about it through customer reports and manual testing.
  • Some Rollbar client libraries do not handle failed requests well and are susceptible to resource starvation caused by retries. We're investigating two reports of this in the Sidekiq backend for the Ruby library, and are investigating one other that we believe is in a custom Java library.
  • We had some problems communicating with customers during the outage:
    • Our status page was hosted on the same domain as our website, so customers who could not reach rollbar.com likely couldn't reach status.rollbar.com either
    • Twitter, our backup channel, also uses Dyn for DNS and was affected by the same issue.

What We're Doing

In response to Friday's events and what we've learned since:

  • We are re-evaluating our DNS setup so that we will be resilient against failures on this tier, likely by using multiple DNS providers. Since we use advanced DNS features like dynamic routing, this is not quite as simple as it may sound and we're doing some research first.
  • We are adding monitoring coverage for DNS.
  • We're moving our status page from status.rollbar.com to rollbarstatus.com, which will use a different DNS provider than our primary domain.
  • We will be investigating how our client libraries handle network failures and audit them to ensure that the impact to the host application is minimal in the case of a network outage.

    For any customers who are looking for an immediate solution to this problem, we'd like to refer you to rollbar-agent to move the network request out of the application and into another process that reads queued events from disk. This also has the benefit that requests will eventually succeed after a network outage. This model is powerful, but currently fairly difficult to set up. We'll be scheduling some improvements to rollbar-agent soon.

Other Resources

There has been a lot of chatter around the web about this outage. Here are a few links that we found helpful or interesting:

Conclusion

We know our customers depend on Rollbar as a critical, reliable piece of infrastructure, and we're sorry that we did not meet this standard on Friday. If you have any questions, don't hesitate to contact us at support@rollbar.com.

Comments