January 11th, 2013 • By Brian Rue
tl;dr: from about 9:30pm to 12:30am last night, our website was unreachable and we weren’t sending out any notifications. Our API stayed up nearly the whole time thanks to an automatic failover.
We had our first major outage last night. We want to apologize to all of our customers for this outage, and we’re going to continue to work to make the Rollbar.com service stable, reliable, and performant.
What follows is a timeline of events, and a summary of what went wrong, what went right, and what we’re doing to address what went wrong.
First some background: our infrastructure is currently hosted at Softlayer and layed out like this (simplified):
We’ve been in the process of setting up lb3 and lb4, along with some fancy DNS functionality from Dyn, to provide redundancy and faster response times to our customers outside of the Americas. Each is running a stripped-down version of our infrastructure, including:
Switching DNS to Dyn requires changing the nameservers, which can take “up to 48 hours”. At the start of this story, it’s been about 36 hours. To play it safe, after testing out Dyn on a separate domain, we configured it to have the same settings as we had before – lb3 and lb4 are not in play yet.
Now the (abbreviated) timeline. All times are PST.
9:30pm: Cory got an alert from Pingdom that our website (rollbar.com) was down. He tried visiting it but it wouldn’t load (just hung). Remembering the pending DNS change, he immediately checked DNS propagation and saw that rollbar.com was pointing at the wrong load balancer – lb1 (the API tier), not lb2.
Cory and Sergei investigated. The A record for rollbar.com showed as correct in Dyn, but DNS was resolving incorrectly.
9:47pm: Cory and Sergei looked at @SoftlayerNotify and saw that there was an issue underway with one of the routers in the San Jose data center.
9:49pm: Website accessible by its IP address.
9:51pm: No longer accessible by IP.
10:05pm: Twitter search for “softlayer outage” shows other people being affected.
10:05pm: API tier (api.rollbar.com) appears to be working. Sergei verifies that it’s hitting lb3 (in Singapore).
You might notice that we said before that lb3 wasn’t supposed to be in service yet. What appeared to have happened DNS had automatically failed over to lb3 (since lb1 was down because of the Softlayer outage). We had set something like this up before when testing out Dyn, but it wasn’t supposed to be active yet. Fortunately, lb3 was ready to go and handled all of our API load just fine.
10:22pm: Sergei tries fiddling with the Dyn configuration to see if anything helps.
10:35pm: Sergei starts trying to get ahold Dyn
10:58pm: Softlayer posts that “13 out of 14 rows of servers are online”. We must be in the 14th, because we’re still unreachable at this point. Brian tries hard-rebooting the ‘dev’ server to see if it helps. It doesn’t.
11:15pm: Sergei gets a call from Dyn, who tells him that the problem was a “stale Real-Time Traffic Manager configuration” and they’re looking into it.
11:54pm: @SoftlayerNotify posts that “all servers are online however some intermittent problems remain”
11:55pm: Sergei notices that the A record for rollbar.com in the Dyn interface appears to have been deleted, and he can’t add it back.
12:00am: Brian sees that rollbar.com is working again. Cory notices that API calls are hitting lb2, causing them to hit the old, non-optimized API handling code on our web tier, overloading them and causing the website to hang. Frequent process restarts minimize the impact.
12:19am: Sergei gets an email back from Dyn saying that they’re still looking into the problem.
12:28am: Dyn calls to say they were able to fix everything. Sergei confirms. lb3 and lb4 are now fully utilized.
12:42am: Brian tweets that all systems are stable.
2:58am: Softlayer tweets that they’re about to run some code upgrades on the troubled router, which will cause some public network disruption.
4:00am:- A customer reports connectivity issues to rollbar.com
4:10am: Softlayer tweets that the troubled router is finally stable.
As a bonus, our Singapore and Amsterdam servers are now in service.
Parts of our service were unusable for a long period of time
In the short term (most of this will get done today):
1b. Set up a web server in a separate datacenter to serve a maintenance page.
1c. Add meta-level checks to status.rollbar.com. It currently gets data pushed from Nagios, but this isn’t helpful when San Jose is entirely unreachable.
2. Add another ‘dev’-like machine that we can use to administer servers, deploy code, etc. if San Jose is unreachable
3. Remove that old code, and make it an error if any API traffic hits the web tier.
And longer term:
1a. Add a host master standby in another datacenter for fast failover. If an episode like last night’s happens again, this will let us get notifications back online in a few minutes instead of a few hours.
1b. Set up a read-only web tier in another datacenter
We hope this was, if nothing else, an interesting look into our infrastructure, and to the journey of building a highly-available we service.
If you have any questions about the outage or otherwise, let us know in the comments or email us at email@example.com