Problem and Cause
From 5:04am UTC until 4:21pm UTC today, users in/near Europe got 404 responses when trying to load our static CSS and JS assets for rollbar.com. This resulted in a visually broken or unusable site, depending on browser and CDN cache.
The cause was a misconfigured load balancer in our Amsterdam data center that was put into service at 12:05am UTC. The assets should have been present on that machine, but they weren't. For several hours, this didn't appear, since the assets were still cached in our CDN. When they fell out of the CDN's cache, requests for those assets from Europe-based CDN endpoints began to 404. Users hitting those endpoints would then get 404s as well.
Detection and Fix
We first noticed this via the last line of defense: customer support channels. IRC, Twitter, and our support inbox were full of reports (thanks everyone!). Unfortunately, since this happened during the middle of the night for our team (we're based in San Francisco), we didn't notice until 3:33pm UTC when I checked my phone on the way to the office. I immediately escalated to the rest of the team. At 3:36pm UTC, Cory started investigating.
The website appeared fine for us, so we suspected it may be some sort of CDN-related issue. Based on the time of the reports, we suspected it might be localized to Europe (later confirmed via Twitter). At 4:05pm UTC, we identified the load balancer addition as the most likely cause.
At 4:06pm UTC, we took the load balancer back out of service. The assets were still cached by the CDN as being 404s, though, so this wasn't enough to fix. At 4:16pm UTC, we deployed a change to the web tier to bust the cache; at 4:21pm UTC, the deploy was complete and the issue fixed. We confirmed the fix with several of the users who had reported the issue.
You can see the start, rise, and fix of this issue on this Rollbar Item that happens when jQuery doesn't load. This is normal to happen intermittently (all networks occasionally fail), but not this much:
While we'd of course prefer not to make configuration mistakes, the primary failure here was that our none of our automated monitoring detected this. That meant instead of being paged as soon as it happened, we didn't know anything was wrong for over 10 hours. "Whatever can go wrong that isn't monitored, is probably already broken."
Today, we'll be making the following changes:
- adding a Nagios alert to monitor 404s on static assets on our load balancers (would have alerted us as soon as the issue started)
- add a header to all requests to identify which load balancer served the request (would have made this easier to identify)
We're also considering:
- a Nagios alert if there is a sudden large number of support tickets
- a Nagios or SMS alert if there is a Rollbar Item affecting a large number of unique users (Rollbar actually already supports sending notifications like this, so it's just a matter of hooking it up for ourselves)