Blog |

Instacart, leader in grocery delivery, relies on Rollbar for production error monitoring

Instacart, leader in grocery delivery, relies on Rollbar for production error monitoring
Table of Contents

 
  • Instacart, a leader in the on-demand marketplace, provides one-hour grocery delivery to users of their app and employs thousands of shoppers across the US to support order fulfillment.
  • Of the one trillion dollar grocery industry, only 1% of market share currently comes from online grocery sales. As Instacart increasingly captures more of this market, they turn to Rollbar for continuous monitoring of their service’s health.
  • With a promise of one-hour delivery, and a shopper workforce relying on their apps, Instacart’s services must be up at all times. Rollbar’s proactive alerting and granular error forensics facilitates the continuous integration and deployment pipeline at the heart of Instacart’s service.

Rollbar allows us to go from alerting to impact analysis and resolution in a matter of minutes. It's fully ingrained into our development cycle and monitoring. Without it we would be flying blind.

Arnaud Ferreri, Engineering Lead for the Consumer Team at Instacart.

The Challenge

While many industries have worked out how to participate in the on-demand marketplace, the one trillion dollar grocery industry is one of the last holdouts. Tight margins may be making some investors skittish. Retraining customer perception about grocery quality and freshness with app-ordered deliveries might be at play. Building and scaling the technology layer that acts as the on-demand engine has been a daunting challenge. It’s this last barrier to entry that Instacart has already dismantled.

Since its inception in 2012, Instacart has been wooing grocery shoppers away from retail spaces and onto smartphones, to place grocery orders through their app. Instacart currently has presences in about 26 markets, partnering with large chains such as Whole Foods, Costco, and Safeway to offer users of the app in-store prices on groceries. The only extra cost to the app user is a simple delivery charge. The big selling point though is the promise of one-hour delivery. Instacart employs thousands of shoppers across the U.S. to fulfill orders, and processes upwards of 20,000 requests per minute.

At this magnitude and pace, Instacart finds itself publishing new code releases around 30 times a day. Tests happen along several layers and stages in the production process before release but the final and most important test is “actually having real customers use our product and use our API endpoints,” says Arnaud Ferreri, Engineering Lead for the Consumer Team at Instacart. “At our stage, if anything goes wrong, we have to be alerted very, very quickly, and that’s what Rollbar has solved for us.”

At our stage, if anything goes wrong, we have to be alerted very, very quickly, and that’s what Rollbar has solved for us.”

The Solution

At its current scale (and still scaling), Instacart does a lot of meta level monitoring. They are hosted by AWS so they use CloudWatch; they view full logs in PaperTrail; they run build tests on their staging servers through CircleCI; and they use New Relic to track performance. But for proactive alerting and for granularity on error forensics, they turn to Rollbar.

“The fact that you get the exact stack trace that links into your code base; the fact that you have all parameters of a given request so you can easily reproduce the issue; the fact that you have information about the customer who triggered the error so you can easily see if it’s the same customer repeating the same error again… All of these make it very easy to understand the scale and complexity of a given problem at a given time,” says Arnaud.

Growing along with the popularity of the app, the Instacart engineering team has scaled pretty rapidly over the past two years, going from five engineers to 70. Working in Rails, JavaScript, and Python, the team has appreciated the way Rollbar integrates seamlessly with these languages, as well as with other tools they use such as GitHub, PagerDuty, and Slack. At this point they are in a continuous deployment cycle for a lot of their services, with any new commits in their code base going live to production. “Because of that, we can’t have engineers looking at production every minute to see if everything is going right. We need to be proactively alerted when something goes wrong,” says Arnaud.

We can’t have engineers looking at production every minute to see if everything is going right. We need to be proactively alerted when something goes wrong.

The Instacart engineers work in their own branches, and those branches get merged back to the master project. With each merge a new build release gets pushed to their staging servers, then to their beta servers and finally the release hits production. At all stages of this process, the team gets pinged proactively by Rollbar about any new errors that come in.

For a time, the Instacart engineering teams adopted an ambitious, power user goal called ‘Rollbar Zero’, shooting for zero Rollbar errors at any given time on any open project. As Arnaud explains it, this involved getting to “a level of health that we were not seeing errors that we were unaware of, or that we were able to control." This exercise not only boosted coding performance but also helped to brilliantly clarify workflows for the teams.

For Instacart, power usage of Rollbar isn’t limited to their consumer app. Since thousands of shoppers across the U.S. rely on their shopper apps for work, Instacart carefully goes with staged releases of these apps. First they release to a subset of beta shoppers, then to one city, then more cities, until finally they release across the board. The engineering teams have recently added Rollbar into these shopper apps to get the same centralized information they utilize for their web projects.

The Results

It’s an exciting time for the grocery industry, and Instacart is poised to change grocery shopping habits permanently. Of course, none of this would be possible without the continuous integration and deployment pipeline driving Instacart’s code ever forward. “The fact that we’re ramping up continuous deployment for a lot of our services - I think that’s only doable because we have Rollbar integrated into our alerting system,” says Arnaud.

Since code errors can potentially affect thousands of real customers and working shoppers, the ability for Instacart to prioritize error resolution according to the size of the problem is vital. “We’re using Rollbar to its maximum to make sure that we get alerted for the right reasons, and that we alert the right people for the right severity of problem,” says Arnaud. One designated person is pinged if a new error pops up, whereas, if a known error is happening multiple times a minute, a team gets pinged. So instant reaction to issues is guaranteed, in proportion to the size of the issue.

Arnaud says that Rollbar has helped his engineering teams “dramatically improve the way we do our detective work, understanding when there’s an actual issue, where it’s coming from, at what volume, how many customers it’s affecting, and what action needs to be taken.” Without Rollbar, they wouldn’t be able to solve issues as quickly as they need to since they depend on the fine, granular context on errors that Rollbar provides.

Rollbar is so ingrained into Instacart’s development cycle that Arnaud admits he and his team don’t even see it as a third-party tool: “Rollbar’s so tightly coupled into the way we work, it seems part of our system as a whole.”

Rollbar’s so tightly coupled into the way we work, it seems part of our system as a whole.


Thank you to Arnuad and the team at Instacart for sharing this level of insight and for leading by example when it comes to continuous delivery and working to maintain error-free experiences for their users.

If you haven’t already, sign up for a 14-day free trial of Rollbar and let us help you
take control of your application errors.

"Rollbar allows us to go from alerting to impact analysis and resolution in a matter of minutes. Without it we would be flying blind."

Error Monitoring

Start continuously improving your code today.

Get Started Shape