Webinar: Continuous Code Improvement Series | How to Improve with AI-Assisted Workflows Register Here Today!
Identify and Resolve Issues As Quickly As Possible
Founded in 2009, Duolingo is a leading language education platform with more than 300 million users worldwide. In the company’s early days, developers would often learn of problems in their user experience from posts on social media by users – or worse yet, from their CEO, who actively reviewed the platform. “It was already late in the cycle,” said Hector Villafuerte, one of the first engineers at Duolingo. “We didn’t want issues being found by users.”
The company released updates every three weeks, and it was an expensive, time-consuming process. “Each release was costly. Developers had to test their code, and then there’s all the back and forth with other stakeholders to get their piece into the release,” said Severin Hacker, Duolingo Co-founder and CTO.
When errors occurred, they were challenging to identify. Teams would have to look at what recently changed within the platform – be it a release, a feature flag, or an A/B test – in order to triangulate an issue. “The errors were never that obvious,” Villafuerte said.
As the company grew, it sought to improve its incident tracking in order to identify and resolve issues as fast as possible. They chose to partner with Rollbar in 2013, drawn to the software’s real-time capabilities, rich contextual information surrounding each error, and open API access.
Leverage Rollbar to Enable Real-Time Issue Identification & Accelerate Time to Resolution
Duolingo instrumented Rollbar directly into its architecture, using the technology to aid in phases throughout the software lifecycle, including quality assurance testing, release, and production monitoring. “I really can’t imagine the development cycle working without Rollbar…It’s hard to imagine life at Duolingo without it,” Villafuerte said.
Duolingo’s engineering team has long been focused on automation, and has leveraged Rollbar’s API to build on top of the software, enabling even faster error detection and resolution. “It’s all about speed of development. Rollbar helps make it easier for everyone involved — not just developers, but also for QA and Operations to surface issues and resolve them faster,” Villafuerte said.
To aid in quality assurance, Duolingo leverages Rollbar to ensure front-end issues in the company’s iOS app, Android app, and web experience do not escape to production. For every pull request in its CICD pipeline, Duolingo automatically runs a number of tests — including unit tests, integrations tests, and smoke tests — and checks for any new Rollbar occurrences within those tests. If any new errors occur, the test fails, but if Rollbar and other complementary systems give the “green light,” the release goes straight into production. “For most of our services, we deploy to production on merge. That wouldn’t be the case if we didn’t have the confidence we could recover quickly,” Villafuerte explained. Armed with the confidence Rollbar gives them, the Duolingo team regularly pushes dozens of changes per day, with minimal reversions.
For release and production monitoring, Rollbar is used in conjunction with Cloudwatch, Grafana, and PagerDuty, identifying errors in real-time and ensuring the right teams are alerted in order to address them. When Rollbar occurrences happen, emails and Slack notifications are automatically triggered to the correct teams; for the most important issues, PagerDuty alerts are automatically triggered.
The team especially appreciates being able to see where each error originates: “Not all the errors are relevant to the current release, so I look for new errors. I have all the microservices monitoring turned on at once across screens. My wife calls it ‘The Matrix,’ and I can see the error appear in one microservice, then the next and the next on down the line,” said Max Blaze, Head of Operations at Duolingo.
Rollbar helps make it easier for everyone involved — not just developers, but also for QA and Operations to surface issues and resolve them faster.HECTOR VILLAFUERTEENGINEER, DUOLINGO
When it comes to production monitoring, the team relies on the robust contextual information Rollbar provides to help prioritize and inform issue resolution. As Blaze explained, “In any given Rollbar item, there are a lot of things we look at to more quickly narrow down which code to look at: How big the issue is by number of occurrences? Is it just one or many accounts that are triggering it? Could it be an attack?”
Once new incidents are identified and documented, they are added to the company’s playbooks, enabling knowledge sharing. “It prevents silos, and saves time for similar issues down the road,” Villafuerte said.
I really can’t imagine the development cycle working without Rollbar…It’s hard to imagine life at Duolingo without it.HECTOR VILLAFUERTEENGINEER, DUOLINGO
Real-Time Error Monitoring and Faster Time to Resolution
Rollbar has become integral to Duolingo’s systems and processes, accelerating incident identification, time to resolution and improving the customer experience. “We don’t know where we would be today without Rollbar. It is mission critical,” Hacker said.
REDUCED TIME TO RESOLUTION: All 200+ of Duolingo’s engineers use Rollbar. Since Rollbar groups errors and stores important contextual data around each error, it helps triangulate issues and reduce the amount of code engineers need to address to investigate and resolve issues.
MORE A/B TESTS AND INNOVATION: Duolingo’s A/B testing framework is built on top of a suite of tools including Rollbar, enabling the company to run 100- 200 experiments concurrently. Experiments that improve key KPIs such as retention, engagement, and monetization without any Rollbar occurrences detected are automatically pushed to their CD pipeline. “If you can run 10x as many experiments, you can be 10x as innovative. Without Rollbar we would not be able to have as many releases or A/B tests,” Hacker said.
MORE CONFIDENT RELEASES: If the Rollbar smoke test doesn’t turn up issues, changes are pushed directly into production. “If we weren’t able to identify issues as soon as possible, or recover as fast as possible, you’d probably spend tons of money testing things. Now we have immediate alarms and we can revert or push a fix. It’s part of the DevOps philosophy to minimize the release cycle,” Villafuerte said.
FEWER LARGE-SCALE INCIDENTS: With Rollbar integrated into the testing and release phases, the company has experienced fewer large-scale incidents. “We have a weekly post-mortem on the calendar,” Villafuerte said. “Sometimes we cancel it, because there’s nothing to discuss!”
MORE TIME SAVED: “We can see the error right away, which saves a ton of developer time,” Hacker said. “It only gets more and more valuable as the team continues to grow.” Since Rollbar’s APIs are open and easy to work with, Duolingo has also been able to integrate Rollbar directly into its architecture and automated workflows. “It saves a lot of time,” Villafuerte said. “Other services don’t have management APIs; manual tasks could take up your whole day.”
INCREASED OPERATIONAL EXCELLENCE: Errors are tracked and documented in Rollbar, promoting knowledge sharing and making junior developers more productive. “They don’t have to rely on more senior developers for help. They can just look at the information in Rollbar and navigate it themselves.”
We don’t know where we would be today without Rollbar. It is mission critical.SEVERIN HACKERCO-FOUNDER AND CTO, DUOLINGO
Rollbar is used throughout the SDLC, including QA, release, and production monitoring
Monitors 100-200 concurrent A/B experiments
New incidents are identified, documented, and added to Rollbar playbooks
Application monitoring stack: Rollbar, Cloudwatch, Grafana, PagerDuty