Yesterday from 2:20am PST until 10:22am PST, we experienced a service degredation that caused our customers to see processing delays reaching nearly 7 hours. While no data was lost, alerts were not being sent and new data was not appearing in the rollbar.com interface during this time.
We know that you rely on Rollbar to monitor your applications and alert when things go wrong, and we're very sorry that we let you down during this outage. We'd like to share some more details about what happened and what we're doing to prevent this kind of issue from happening again.
Overview of API Tier and Processing Pipeline
When data is received by our API endpoints (api.rollbar.com), it hits a load balancer which proxies to an array of "API servers". Those servers do some basic validation (access control, rate limiting, etc.) and then write the data to local disk. Next, a separate process (the "offline loader") loads these files asynchronously into our database clusters. Then, a series of workers process the raw data into the aggregated and indexed form you see on rollbar.com, and send alerts for any triggered events. This system is designed for reliability first and performance second.
When occurrence processing latency exceeds 30 seconds, we show an in-app notification that processing is behind. This is calculated as follows:
- For each API server, calcuate ([timestamp of last occurrence received] - [for the last occurrence that was fully processed by the pipeline, the timestamp it was received on the API server])
- Report the maximum delay of API servers as the processing delay
The API tier primarily receives three kinds of data: occurrences (the "item" endpoints), deploys, and symbol mapping files (source maps, Android Proguard files, and iOS dSYMs). Currently, all three of these are loaded by the same offline loader process, to different database clusters depending on the type of data.
Outage Timeline, Cause, and Resolution
At about 2:00am PST, a node in the database cluster that stores the symbol mapping files ran out of disk space. Unfortunately, this did not set off any alerts in our monitoring system because the disk space alert had been previously triggered and acknowledged, but not yet resolved.
At about 2:20am PST, the next symbol mapping file arrived on one of the API servers and since the database server was out of disk, it could not be loaded. This caused other files on that API server–containing occurrences and deploys–to not be loaded either. At this time, a processing delay first appeared in the Rollbar interface, and some (but not all) data was delayed. Over the next several hours, the delay continued to rise (as data on some API servers was not processed at all) and the percent of data that was delayed also rose (as more API servers enocuntered the same problem).
At 8:25am PST, a Rollbar engineer started work for the day and noticed a support ticket about the processing delay. He immediately escalated to a second engineer who began investigating. At 8:40am PST, a third engineer joined and updated status.rollbar.com to say that we are investigating the issue.
At 9:05am PST, we identified the immediate problem that the symbol mapping files were blocking occurrences from being loaded. We began mitigating by moving those files aside to allow the higher-priority occurrence data to load. This began the recovery process, but created a backlog at first level in the processing pipeline, causing all data to be delayed (instead of just some).
At 9:11am PST, we identified disk space as the root cause, and resolved this a few minutes later. At 9:35am PST, we updated status.rollbar.com to state that we had identified the issue.
At 9:55am PST, processing latency hit a peak of about 25,000 seconds. We updated status.rollbar.com with our estimate of 36 minutes to full recovery.
At 10:43am PST, processing was fully caught up. status.rollbar.com was updated a minute later.
Evaluating our Response
Once our team became aware of the issue, we were able to identify and fix it relatively quickly (40 minutes from awareness to identification, with fix immediately afterwards). Recovery was relatively fast as well, given the length of the backlog (1hr 38minutes to recover from 7hrs 45min of backlog).
It took far too long to for us to notice this issue, however, as our automated monitoring failed to alert us and we only discovered the issue via customer reports.
We've identified and planned a number of improvements to our processes, tools, and systems to address what went wrong. Here are the highlights:
- We're auditing our internal monitoring to ensure that checks are meaningful, needed checks are present, and alerts are functioning correctly
- We've created a standing Google Hangout so we can more easily coordinate our response when we're not all in the same location
- We're investigating whether we can automate the component status on status.rollbar.com, so that it doesn't need to be manually updated. (Side note: go there now and click "Subscribe to Updates"!)
- We've identified and planned system-level improvements to decouple symbol mapping uploads from the critical occurrence processing pipeline
- We've planned a project to improve our ability to recover more quickly from processing backlogs
We're very sorry for the degradation of service yesterday. We know that you rely on Rollbar for critical areas of your operations and we hate to let you down. If you have any questions, please don't hesitate to contact us at firstname.lastname@example.org.