Blog |

Post-mortem for website assets outage

Post-mortem for website assets outage

We had an issue from late last night through this morning where many users were not able to use the
rollbar.com website because CSS and Javascript assets were not loading in some regions. This post
will cover what happened, its cause, why we didn't notice it sooner, and the changes we're making
going forward.

Problem and Cause

From 5:04am UTC until 4:21pm UTC today, users in/near Europe got 404 responses when trying to load
our static CSS and JS assets for rollbar.com. This resulted in a visually broken or unusable site,
depending on browser and CDN cache.

The cause was a misconfigured load balancer in our Amsterdam data center that was put into service
at 12:05am UTC. The assets should have been present on that machine, but they weren't. For several
hours, this didn't appear, since the assets were still cached in our CDN. When they fell out of the
CDN's cache, requests for those assets from Europe-based CDN endpoints began to 404. Users hitting
those endpoints would then get 404s as well.

Detection and Fix

We first noticed this via the last line of defense: customer support channels. IRC, Twitter, and our
support inbox were full of reports (thanks everyone!). Unfortunately, since this happened during the
middle of the night for our team (we're based in San Francisco), we didn't notice until 3:33pm UTC
when I checked my phone on the way to the office. I immediately escalated to the rest of the team.
At 3:36pm UTC, Cory started investigating.

The website appeared fine for us, so we suspected it may be some sort of CDN-related issue. Based on
the time of the reports, we suspected it might be localized to Europe (later confirmed via Twitter).
At 4:05pm UTC, we identified the load balancer addition as the most likely cause.

At 4:06pm UTC, we took the load balancer back out of service. The assets were still cached by the
CDN as being 404s, though, so this wasn't enough to fix. At 4:16pm UTC, we deployed a change to the
web tier to bust the cache; at 4:21pm UTC, the deploy was complete and the issue fixed. We confirmed
the fix with several of the users who had reported the issue.

You can see the start, rise, and fix of this issue on this Rollbar Item that happens when jQuery
doesn't load. This is normal to happen intermittently (all networks occasionally fail), but not this
much:

Planned Improvements

While we'd of course prefer not to make configuration mistakes, the primary failure here was that
our none of our automated monitoring detected this. That meant instead of being paged as soon as it
happened, we didn't know anything was wrong for over 10 hours. "Whatever can go wrong that isn't
monitored, is probably already broken."

Today, we'll be making the following changes:

  • adding a Nagios alert to monitor 404s on static assets on our load balancers (would have alerted
    us as soon as the issue started)
  • add a header to all requests to identify which load balancer served the request (would have made
    this easier to identify)

We're also considering:

  • a Nagios alert if there is a sudden large number of support tickets
  • a Nagios or SMS alert if there is a Rollbar Item affecting a large number of unique users
    (Rollbar actually already supports sending notifications like this, so it's just a matter of
    hooking it up for ourselves)

Related Posts

See all posts