CircleCI’s continuous integration and delivery platform helps software teams rapidly release code with confidence, by automating the build, test, and deploy process. It is used by all kinds of companies, from startups like Kickstarter, to larger businesses like GoPro and Spotify, all the way to Fortune 500 companies like Facebook.
Since 2014, CircleCI has increased massively in size, both in terms of number of customers and the infrastructure required to support them. The engineering team is now over 40 persons strong, and they’re doing dozens of deploys per day.
Rob said, “As the system we manage has gotten larger, being able to identify not only whether an issue is occurring but also where it is occurring across the system; what specific boxes might be triggering a particular error condition; and then digging through stacks and identifying what would have led to that; and then of course being able to associate that with new deploys – is a huge part of our debugging cycle.”
With a mixed fleet of hundreds of machines running in production, being able to detect and diagnose errors in such ways is important, so that they can quickly nullify the error’s impact on customers.
That’s where Rollbar comes in.
If you don’t have Rollbar, it’s your customers telling you when something goes wrong, which is a terrible situation. Like, ‘Hey, I just got this 500’, and usually that’s on Twitter. There are so many emotional pains that developers and operators have from these kinds of hideous errors that they’ve shipped. What if you could make that go away? That’s what Rollbar does.Rob ZuberCTO of CircleCI
CircleCI is primarily a Clojure shop, with ClojureScript on the front-end. The engineering team uses Datadog for metrics, PagerDuty to inform engineers about alert conditions, and Graylog (built on top of Elasticsearch) to consolidate log events. CircleCI chose Rollbar to complete their application monitoring stack. Rollbar is integrated throughout this stack, and has a very distinct place within it.
Rob said, “Rollbar really fits in as a cohesive piece of our debugging strategy for errors or issues in production. We often start from Rollbar when something is alerted to us. But we might come back and use it as an analytics mechanism to understand something we saw in Datadog or in Graylog and the pieces really fit together to give us a complete picture of errors in production.”
Rob described an example:
“In the case where there’s a specific host that is having problems in production - (at the scale that we operate, it happens) - we would see through an elevated error rate of a specific type, and all of those errors are coming from that one specific host.
Because we use immutable infrastructure and there are always more of the same type of box, now we can just shut down that machine and have it replaced, at which point the problem is gone and the customer impact disappears very quickly,” says Rob.
CircleCI was also impressed by Rollbar’s feature set, such as people tracking and deploy tracking. People tracking allows the team to get granular information on how each individual customer is experiencing their site, and the progression of the bugs they’re experiencing. Deploy tracking has revolutionized the way the engineering team tracks down and does triage on bugs, because identifying which deploy is associated with which error is now very easy.
The team also liked having the ability to customize when Rollbar should alert them, so they would know when a known issue which has been low on the priority list is suddenly becoming a big deal.
Today, CircleCI’s engineering team installs Rollbar whenever they create a new service. They rely on Rollbar ubiquitously. “Without the visibility Rollbar provides in the production landscape, we simply would not be able to operate at the scale we do,” says Rob.
Rob stressed that it’s catching the unexpected error that is the indispensable part of having Rollbar tracking CircleCI’s code in production.
“There are things that you don’t think about or miss in the process of building a new piece of code. One of the things that gives you the comfort to operate in a model of continuous delivery is knowing that if you did miss something it will be caught immediately and alerted back to you very quickly. That’s the critical piece that allows you to ship with confidence.”
“One of the things I talk about with our customers is that continuous deployment, though it scares a lot of people, is actually a mechanism for reducing risk, because the change that you deploy is very, very small.
But the key to that working is great instrumentation. You can’t just blindly throw things into production, and assume that everything is great. You really need to feel confident that you understand what’s happening.
And without Rollbar giving us visibility into exceptions in production, we just wouldn’t be confident and we’d ship more slowly.
So, I would say velocity, in terms of shipping, is what we gain by using Rollbar.”
Production deploys: dozens per day
Graylog on Elasticsearch
CircleCI’s debugging strategy entails not only getting alerted by Rollbar, but also using Rollbar as an analytics mechanism to understand something they see in Datadog or Graylog
Production deploys: dozens per day