The more we can catch in staging and shadow, the less we have to deal with in production. We have basically everything in Rollbar. It has helped us respond more quickly.Senior Site Reliability Engineer
Since its founding in 2011, this online lending company, which provides loans for education, home-buying and other large purchase endeavors, envisioned doing things differently. By providing loan services completely online, it’s been able to differentiate itself from other competitors by offering lower interest rates and savings to its customers. Eight years after launch, this major enterprise organization is now a thriving company in the FinTech space with over 1,000 employees.
Innovation is simply part of the culture at this company, and the engineering organization is a reflection of this. Their practices, tech stack, and team resemble those of a small and lean startup rather than a financial institution. This has allowed them to deliver vast amounts of value to their users and disrupt the loans and financial management market in a short amount of time.
Given the sensitive nature of the industry they are in, stability in production is critical to their brand reputation and growth. Observability and reliability practices have always been important for their processes. To support the extraordinary need for stability and quality required in the financial space, a new environment was created. This shadow environment matches production to a higher degree than most staging environments in tech companies. Scale, data, and infrastructure are matched one to one. Stress tests can push the shadow environment to handle up to 12 times the load of production.
Before implementing Rollbar for error monitoring and tracking as part of their staging and shadow setup, this company was relying on an ELK setup (Elasticsearch + Logstash + Kibana) along with an internal PaaS (Platform as a Service) that allowed individuals to spin up a full ephemeral environment when needed.
While this company was proud of achieving a high level of stability in the staging phase, the pace of exceptions and the real-time need to report on these during their CI builds weren’t met by their ELK setup, regardless of the investment and manpower dedicated to keeping it responsive. In case of a bad deploy, the ELK stack could fill up and stop reporting or on the contrary keep generating errors incessantly. They realized that their present tech stack alone wasn’t capable of handling the level of complexity and the deployment schedule they were aiming for to maintain the speed of innovation.
Rollbar was the answer they were looking for. After implementing Rollbar to handle all error tracking and monitoring in their staging and shadow environments, this FinTech company was able to ensure that no unexpected exceptions were slipping through to production, no matter how many teams pushed or how often. "The more we can catch in staging and shadow, the less we have to deal with in production. We have basically everything in Rollbar,” says Marcus, a Senior Site Reliability Engineer at this company.
By implementing such a robust system early in the development process, they continued to keep up with their focus on innovation and continuous deployment. According to the Site Reliability Engineer, “The ELK stack was limited on what it could do. Rollbar has helped us respond more quickly. We don’t have any false alarms or on the other hand, don’t have to worry about not getting an alarm when we should.”
Various functionality and integrations that Rollbar offers are being used extensively across the organization. They have heavily adopted the use of “custom fingerprinting”, allowing them to get the maximum benefit of the grouping functionality. Plus, an additional transaction identifier with each exception, grouping, and correlation of exception across multiple microservices has been of great value.
By integrating Rollbar with JIRA, tickets are created automatically for every new error generated across all environments and assigned to the team that responsible for fixing it. Being able to “link” or “de-duplicate” bug reports by tracking the exception of origin that caused them in both the same build as their previous occurrences in unrelated builds has delivered huge gains.
Rollbar has been adopted across the system and in every environment of the Software Development Lifecycle past DEV. This allows the team to catch and fix bugs faster at each environment before they made it to production. The faster iterations at every environment lead to faster builds, more deployments, and ultimately, delivering more value to customers faster.
The ability to correlate errors across different builds enables the team to assign the right person to the error saving a lot of crucial developer time.
In addition to providing the stability and confidence the teams need to release to production, an added bonus of using Rollbar is that its mitigated the need to hand-hold engineers. Instead, this company has built a relationship of trust with each individual and team, and engineers have a high degree of ownership through the DEV and QA environments.
Using the Rollbar data, error reports are created across engineering teams over time to track the performance. A planned next step is creating cross-environment reports to measure the effectiveness of staging and QA phases. In addition, a natural progression to their current observability practices is building correlations between test failures and their original exceptions.
“By implementing Rollbar across the entirety of our applications and environments, we can leverage the features and data a lot more. Because of Rollbar, we’ve been able to achieve best practices that we never thought we could get to,” shared the Site Reliability Engineer.
This process and these practices across the set of environments has allowed their engineering teams to release code to production around 90 times a day. Especially given their industry, this is a major achievement, as other legacy financial organizations still release only a handful of times a year.