Proactive Error Monitoring with Rollbar
Since its inception in 2012, Instacart has been working to make online grocery shopping commonplace. It allows consumers to order groceries from their phone or the web and have them delivered to their doors in minutes.
Instacart is currently present in over 25 markets and partners with large chains such as Whole Foods, Costco, and Safeway to offer users in-store prices on groceries. It employs thousands of shoppers across the U.S. to fulfill orders, and processes upwards of 20,000 requests per minute.
Growing along with the popularity of the app, the Instacart engineering team has scaled pretty rapidly over the past two years, going from five engineers to over a hundred today. At this magnitude and pace, Instacart finds itself publishing new code releases around 30 times a day. Tests happen along several layers and stages in the production process before release, but the final and most important test is "actually having real customers use our product and use our API," Arnaud Ferreri, Engineering Team Lead at Instacart.
The engineering team practices continuous delivery in developing a lot of the services. Engineers work in their own branches, and those branches get merged back to the master project. With each merge, a new build release gets pushed to their staging servers, then to their beta servers and finally the release hits production.
Arnaud explained, "Because any new commit to the code base is going live to production, we can't have engineers looking at production every minute to see if everything is going right. We need to be proactively alerted very, very quickly when something goes wrong. That's what Rollbar does for us."
Rollbar allows us to go from alerting to impact analysis and resolution in a matter of minutes. It's fully ingrained into our development cycle and monitoring. Without it we would be flying blind.Arnaud Ferreri - Engineering Lead Team, Instacart
At its current scale - and still scaling - Instacart does a lot of meta level monitoring. They are hosted on AWS so they use CloudWatch; they view full logs in Papertrail; they run build tests on their staging servers through CircleCI; and they use New Relic to track performance. But for proactive alerting and for granularity on error forensics, they turn to Rollbar.
“The fact that you get the exact stack trace that links into your code base; the fact that you have all parameters of a given request so you can easily reproduce the issue; the fact that you have information about the customer who triggered the error so you can easily see if it’s the same customer repeating the same error again… All of these make it very easy to understand the scale and complexity of a given problem at a given time,” said Arnaud.
Since code errors can potentially affect thousands of real customers and working shoppers, the ability for Instacart to prioritize error resolution according to the size of the problem is vital. “We’re using Rollbar to its maximum to make sure that we get alerted for the right reasons, and that we alert the right people for the right severity of problem,” said Arnaud.
The Instacart engineering teams use Rollbar so extensively that at one point, they made it their goal to achieve ‘Rollbar Zero’, which is to have zero Rollbar errors at any given time on any open project. The ambitious exercise not only boosted coding performance, but also helped to brilliantly clarify workflows for the teams.
As Arnaud explained it, this involved getting to “a level of health that we were not seeing errors that we were unaware of, or that we were able to control."
Rollbar is now being used for other projects beside Instacart’s consumer app.
Since thousands of shoppers across the U.S. rely on their shopper apps for work, the Instacart teams carefully stage the releases of these apps. They would release to a subset of beta shoppers first, then to one city, then more cities, until finally they release across the board.
The teams have recently added Rollbar into these shopper apps to get the same centralized information they utilize for their web projects.
It’s an exciting time for the grocery industry, and Instacart is poised to change grocery shopping habits permanently. Of course, none of this would have been possible without the continuous integration and deployment pipeline driving Instacart’s code ever forward.
“The fact that we’re ramping up continuous deployment for a lot of our services - I think that’s only doable because we have Rollbar integrated into our alerting system,” said Arnaud.
Arnaud said that Rollbar has helped his engineering teams “dramatically improve the way we do our detective work, understanding when there’s an actual issue, where it’s coming from, at what volume, how many customers it’s affecting, and what action needs to be taken.”
Without Rollbar, they wouldn’t be able to solve issues as quickly as they need to since they depend on the fine, granular context on errors that Rollbar provides.
Rollbar is so ingrained into Instacart’s development cycle that Arnaud admitted he and his team don’t even see it as a third-party tool.
“Rollbar is so tightly coupled into the way we work, it seems part of our system as a whole.”
"Rollbar is an essential part of our release process, helping us make sure that the code and the new features we’re shipping are as high quality as possible."Read their story
"There are so many emotional pains that developers and operators have from these kinds of hideous errors that they’ve shipped. What if you could make that go away?"Read their story
Give us a few details and we'll get in touch!