Twilio is a software company based in San Francisco, CA that empowers business communication through its voice, video, messaging and two-factor authentication services. Developers from companies such as Netflix, Airbnb, Lyft, and Intuit use Twilio’s cloud API to build complex communication systems that are embedded into web, mobile and desktop platforms.
With 50,000+ customers on the Twilio platform, plus hundreds of millions of end-users interacting with Twilio’s services on a regular basis, the company’s number one objective is to build trust.
“Developers and end-users have to rely on us to provide communications at scale, but also to provide highly resilient communications,” says Tyler Wells, Director of Engineering: WebRTC/SDKs at Twilio. “We have banks and customer service companies building communication using Twilio, customer service, but also nonprofits utilizing us for things like a suicide prevention hotline. Our services have to be up and available all the time.”
To maintain this trust, Twilio focuses heavily on providing and measuring operational excellence across all of its teams, services and products. A few years ago, at SIGNAL, Twilio’s annual developer conference, it announced the Operational Maturity Model or OMM, a framework that personifies operational excellence across a number of dimensions, from operations to build and test to security. To achieve “Iron Man” status, the highest level in the OMM, all teams must check off a number of criteria – including the implementation of Rollbar.
Twilio runs a large number of distributed systems over multiple AWS instances. Teams are shipping code continuously multiple times a day. As a result of running such a large-scale operation in the cloud, errors are part of the equation. Before Rollbar, the typical way for engineers to deal with errors was to log into a box or aggregate logs across multiple services to pinpoint the issue. This was time-consuming, frustrating and didn’t always provide the directional visibility to solve problems.
After learning about Rollbar in 2014, Tyler and the WebRTC/SDKs team decided to try it out to see if it would provide a better way to track and manage errors. They immediately noticed the benefits of having a tool programmatically trap these exceptions and provide rich error data in a dashboard.
Rollbar is our early warning system for errors. The worst thing that can happen is a customer writes in to the support team to say something is broken. Rollbar allows us to be ahead of our customers and to fix issues before they know something is wrong.Tyler WellsDirector of Engineering, WebRTC/SDKs
After other teams caught wind of the benefits that the WebRTC/SDK’s team was getting from Rollbar, the platforms team, which provides services for all the product teams within Twilio, decided to implement Rollbar directly into its tooling. Now every team at Twilio uses Rollbar as part of the OMM model because it not only captures errors faster and easier than ever before, but it provides directionality for how to solve issues.
Twilio runs systems across 27 different data centers in 9 regions across the world and Rollbar helps pinpoint things like when a data center or specific region is impacted. Teams benefit from this level of visibility, says Tyler, and how easy it is to get directional data without digging into stack traces or log files.
If a tool is a burden to implement, developers won’t use it. That obviously isn’t the case with Rollbar. It’s simple to set up. And it gives us value by providing actionable exceptions, aggregation, alerting, and directionality.Tyler WellsDirector of Engineering: WebRTC/SDKs
Rollbar has provided Twilio with many benefits, but the most quantifiable metric has been its impact on Mean Time to Discovery (MTTD) for tracking and resolving issues. While there is plenty of focus on Mean Time to Recovery (MTTR), or how an error gets resolved, equally important is how fast issues are identified and whether this happens before a customer ever notices.
“If my Mean Time To Discovery is based on a customer support ticket, I’ve failed,” says Tyler. “If it’s based on the number of exceptions or errors that I’ve tracked and sent to Rollbar, we’re doing a pretty good job.”
Tyler says that the combination of Rollbar’s tracking, aggregating and alerting of errors, along with Twilio engineers becoming more thoughtful of how error cases are logged and designed, has resulted in better directionality and speed to solving an issue - especially for those early morning incidents.
“It’s 3am, you’re sleepy eyed, the last thing you want to do is log into a box and grep log files,” says Tyler. “As you’re walking to your computer, you want to look at the PagerDuty alert, see the information from Rollbar and have a sense of direction quickly. You’ve hit that mean time to discovery - now you focus on the resolution.”
As Twilio has scaled it business - the company IPO’d in 2016 - Rollbar has scaled seamlessly alongside with it.
“Rollbar lets us sleep at night,” says Tyler. “Our on-call engineers can trust that Rollbar is receiving the error information, properly aggregating it, and triggering incidences that need to be handled. It’s something we’ve come to rely upon.”
Users of Lyft, Netflix, banks, and non-profits like suicide prevention hotline all use Twilio
Teams must use Rollbar to achieve Ironman, the highest status in their Operational Maturity Model
Rollbar helps with Mean Time to Discovery (MTTD), not just MTTR
Rollbar has been able to scale alongside Twilio since the early days to post-IPO