Join our Live Demo: "Advanced Error Monitoring & Debugging" on 9/24/2020 12:00PM PDT. Register here
June 19th, 2017 • By Rivkah Standig
Hopefully you've had the chance to try out our latest feature, error merging. We've heard a lot of positive feedback from our users. They're especially excited to be able to easily merge and un-merge related errors. We thought it would be useful to share how the Rollbar team made this happen from a technical standpoint. If you're interested in the nitty-gritty of how we implemented error merging, read on.
I interviewed an engineer here at Rollbar who was instrumental in making error merging possible, about what was involved in engineering this feature.
First, let's start with why we created error merging in the first place. Our users were asking for the ability to merge (and unmerge) their errors via our UI, as opposed to having to create custom fingerprinting rules. Prior to creating error merging, if you wanted to merge two items together, you created a custom fingerprinting rule, which doesn't apply retroactively. You also could never unmerge items that had been merged together. As the engineer explains it, "custom fingerprinting is extremely static and only applies when the rule is in existence. You can look at an occurrence of an item and extract the fingerprint, and it is always associated with the item. Item merging is very dynamic. You never change the fingerprint of a given occurrence, and that fingerprint could even come from custom fingerprinting. You can change your mind over and over. Merging is a tool for managing complexity. You can take your dashboard from a zillion items to a tenth of that."
In order to delve into the details of how our engineering team implemented error merging, let's define some terminology we'll be using:
|-------------------|-------------| | Occurrence | An instance of an error item. | | Native Item | An aggregation of occurrences that all share a distinct fingerprint. This fingerprint is computed by either custom fingerprint rules or Rollbar's default fingerprinting algorithm. | | Group Item | An item comprising one or more Native Items that have been merged together under user direction. | | Item | Either a Native Item or a Group Item. | | Merge | The process of creating or altering a group item by adding one or more Native Items to its constituents. | | Un-merge | The process of altering (or altogether destroying) a Group Item by removing one or more Native Items from its constituents.| | Merge Transaction | A unique identifier created whenever a user merges or un-merges items. |
Implementing error merging was tricky, since we needed to be able to merge and un-merge every
Group Item without changing any behaviors of the
Native Items. To do this, the team created dual identities for items:
Native Items and their associated fingerprints, and other aspects that are merged into an umbrella item (the
Group Item). A
Group Item, under the hood, "both is and isn’t an item in its own right," according to the engineer. It is a single point of visibility of the constituent
Native Items, where they can be managed and controlled. All the
Native Items in a
Group Item can become activated, muted, and resolved together. In order to do this, we needed to tease apart every aspect of item behavior and decide whether to approach this in a top-down (the property is defined by the
Group Item) or bottom-up (it is defined by the Native Item) fashion. For example, the team needed to decide if you should be able to search for a Native Item's original title and find the
Group Item or not (we decided you should).
The team implemented error merging as a logical view on top of
Native Items. All
Native Items are still captured using the current fingerprinting algorithm, as they were previously. We now also store a mapping from a
Group Item to each merged Native Item it comprises in our database. This allows constituent
Native Items to be un-merged at any time, and their original item counts will still be accurate. One interesting thing to note about
Group Items is that they can be merged with other
Group Items, yet the data structure we maintain is flat. We discard the intermediate
Group Items in such cases, keeping only the final resulting
Group Item. What this means is if two different
Group Items are merged, the
Native Items forming those two original
Group Items will now constitute one new
Group Item. One of the two previous
Group Items will be 'archived' — that is, no longer exist as an active object.
In order to track these
Native Items to
Group Items mappings, the team created an entirely new table in our database. In this table there is a unique transaction ID associated with every merge and un-merge action, so there is a complete historical record of every merge or un-merge a Native Item undergoes. Since items are the meat of what Rollbar deals with, error merging touches a lot of the tables in our database. We added some new columns into our item table to delineate whether or not a given item is a
Group Item or a Native Item. If a Native Item, the status column notes if the item is active, resolved, muted, or part of a
Group Item. If the Native Item is part of a
Group Item, the status of the item is determined by the status of the
Group Item. When two items are merged, a new
Group Item is created in our item table and the constituent
Native Items are updated (but we do not notify you of a new item according to your New Item notification settings). When we update all the tables that are effected when a
Group Item is created or changed, we lock the
Group Items and
Native Items respectively to isolate the calculation and database updates and avoid race conditions.
There are top-down properties and bottom-up properties that are associated with an item. Top-down properties of a
Group Item are set on the
Group Item and affect the
Native Items that constitute the
Group Item. These include the level of the item, the status, the assigned user, timestamps denoting when the item's status was last set to active/muted/resolved, and the users who are subscribed to the item. Bottom-up properties of a
Group Item are determined by the
Native Items in the
Group Item. Some bottom-up properties of a
Group Item include the IDs and timestamps for the first occurrence, last occurrence, and activating occurrence, as well as the total number of occurrences. When a Native Item is un-merged from a
Group Item, all of the properties that were top-down now must be set on the Native Item. This includes users who were subscribed to the
Group Item; all subscribers are thus subscribed to the
Native Items that once constituted the
Let's walk through what all of this means in practice. What happens when an item hits our pipeline? For any occurrence, we first store it in our database, generate a fingerprint, and find a row associated with that fingerprint in our item table. If it's an occurrence of an unmerged item, that row correlates to a
Native Item. The code then checks if that
Native Item is associated with a
Group Item or not. Since this is an unmerged item, it isn't associated, and we continue down our processing pipeline as normal. For an occurrence of a merged item, we will find a reference to a
Group Item ID. We then increment the total occurrences for both the Native and
Group Items, and update other tables as needed.
While we have been discussing the back end of how error merging works, it's important to note that implementing the user experience side of this was no easy feat either. We needed to create new views for
Group Items, including all of the associated functionality that a user would expect, such as the ability to comment on the item. Additionally the team implemented restrictions on a user's ability to edit a
Native Item once it has been merged into a
Group Item. A wide variety of user interface considerations were discussed, including how to delineate visually the difference between a
Native Item and a
Group Item on a user's items view page. Some new user interface components specific to error merging were created, such as a popup that allows you to toggle between editing and merging items.
Given how complex implementing all of this was, I asked the engineer what his favorite part of working on error merging was. "The relational database aspect of it. You can always go to the database and cut through the layers of code. The database ultimately offered a way out of the stickiest corners of the functionality."
To learn more about how to merge your items, check out the docs here.
If you haven’t already, signup for a 14-day free trial of Rollbar and let us help you take control of your application error monitoring. :-)