Connect Rollbar to Bitbucket Issue Tracker

Posted by Mike Smith on March 17, 2015

 

New integration now available - Bitbucket Issue Tracker

Supercharge your issue and error tracking workflow when you connect your Rollbar and Bitbucket accounts. New Items in Rollbar will instantly create Issues in your Bitbucket repo, or you can create and link Issues with the click of a button within Rollbar.

Here's how:

Go to your project's Settings, then Notifications, and select Bitbucket Issues from the list of channels. Click 'Connect with Bitbucket” to grant Rollbar access to your account.

 

 

From here, you can choose which repository, and add/edit/remove rules for Issues to be created automatically.

Like magic, your Rollbar error items and details now show up in your Bitbucket repo. Success! 

 

 

Create Bitbucket Issues manually

Prefer to create Issues by hand? You can create an Issue directly from the error Item page in Rollbar, or link with an Issue that already exists. You can use this alongside the automatic rules; or, remove the rules for full manual control.

 

 

What's next?

We're working toward full support for Bitbucket, like we have for GitHub - Issues, Source Control and Authentication. I know Rollbar users who rely on Bitbucket in their workflows are rejoicing. :) 

Let us know if you have any feedback or questions. We're here to help.

Deploy and enjoy!


Daily, Hourly, New Errors and Trend graphs are now clickable

Posted by Mike Smith on March 10, 2015 · product

Yes, that's correct.

Daily, Hourly, New Errors, and Trend graphs are now clickable. You can find and fix bugs even faster, and in less clicks. :D

Common usability feedback we get from our users:

 

Sure would be nice if I could click the dashboard bar graphs and sparklines to quickly see what caused a spike in error events etc.

 

Couldn't agree more. We love aggregating data and we love it clickable. So we enabled it!

The following are now clickable in the project Dashboard:

Trends are also clickable on the Items page. For reference Trends are these guys error tracking trendsalso called 'sparklines'.

  

error tracking

  

When viewing a specific error item, the Last 60 Minutes, Hours, and Days are now clickable and aggregate error data by your selection.

 

 

We're excited to get this features out the door. It reduces a lot of friction in navigating Rollbar. One of many UI and UX improvements to come. :)

Login today and go click through your data now. 

Don't have a Rollbar account? No worries, you can give our Live Demo a try and 'click all the things'. ;)

Deploy and enjoy!

 


Using Logstash and Rollbar Together

Posted by Ken Sheppardson on March 2, 2015

The infrastructure behind most modern web applications includes an assortment of tools for collecting server and application metrics, logging events, aggregating logs, and providing alerts. Most systems are made up of a collection of best-in-class tools and services, selected and deployed over time as team members arrive and depart, needs change, the system grows, and new tools are introduced. One of the challenges web development and operations teams face is collecting and analyzing data from these disparate sources and systems and then piecing together what’s happening by looking at multiple reports and dashboards.

Two common pieces in this puzzle are Logstash and Rollbar.

Logstash (and the Kibana web interface, both of which are heavily supported by and integrated with Elasticsearch) lets you collect and parse logs, store them in a central location, search and explore the data via the Kibana UI, and output events to other services. Logstash provides a powerful tool for taking logs in many different formats, converting them into JSON events, then routing and storing those events.

 

kibana-screenshot

 

Rollbar collects errors from your application, notifies you of those errors, and analyzes them so you can more efficiently debug and fix them. With a few lines of code or config changes to your application, you can make errors, complete stack traces, trends and affected user reports accessible via your Roller dashboard. Like Logstash, Rollbar collects and analyzes events represented in JSON.

 

rollbar-screenshot

 

 

By connecting Logstash and Rollbar, you can not only centralize and analyze your system and application logs, but also improve error tracking and simplify debugging by providing context to developers looking at errors generated by their code.

Most Logstash users have tried to configure Logstash to parse multi-line exception messages, or have tried to convince a development team to adopt standards for application debugging and error message. For code your team controls, it’s likely much simpler to install the Rollbar notifier for the language and framework you’re using. You can send errors from your RubyPython, or PHP application or browser JavaScript to Rollbar and the service will parse stack traces automatically and update your dashboard in real time. You can see the values of local variables when the error occurred, and these errors are associated with other errors of the same type.

For Rollbar users, Logstash allows you to collect errors from external applications and ship them to Rollbar, where they'll appear on the same dashboard as your application errors. Database and web server errors, for example, can be passed along to Rollbar to help developers determine whether the error is due to a bug, database performance issue, or operational issue with the web server.

To get started...


Increasing max-open files for beanstalkd

Posted by Cory Virok on February 28, 2015 · engineering, infrastructure

Quick tip: If you are running out of file descriptors in your Beanstalkd process, use /etc/default/beanstalkd to set the ulimit before the init script starts the process.

e.g.

 

# file: /etc/default/beanstalkd
BEANSTALKD_LISTEN_ADDR=127.0.0.1
BEANSTALKD_LISTEN_PORT=11300
START=yes
BEANSTALKD_EXTRA="-b /var/lib/beanstalkd -f 1"

# Should match your /etc/security/limits.conf settings
ulimit -n 100000

 

Lot's of resources online tell you to update your /etc/security/limits.conf and /etc/pam.d/common-session* settings to increase your maximum number of available file descriptors. However, the default beanstalkd installation on Ubuntu 12.04+ uses an init script that starts the daemon process using start-stop-daemon which does not use your system settings when setting the processes ulimits. Just add this line to your defaults and you're good to go!

 


Get notifications every time an error occurs

Posted by Mike Smith on February 26, 2015

You can now setup notifications every time an error occurs. Previously specific error Notifications were only avaiable for New Items and 10^th Occurrences. Notification Rules are available for all Channels (Email, Slack, HipChat, Trello, PagerDuty).

 

Setup notifications every time an error occurs

 

 


Assign errors to your team

Posted by Mike Smith on February 26, 2015

Ever wanted to assign error items to other team members in Rollbar? Of course you have. Now you can. It is a pretty straight forward enhancement, but here is an overview. 

On the error ‘items’ details page, there's an “Assigned to" dropdown with the members of your team. Once assigned, we’ll shoot an email to that team member letting them know you assigned that specific item to them, including link and details. They'll be automatically added as a 'watcher' for that specific item and will receive notifications about any comments and updates. 

 

 

Assignment events will be listed in the item history section, so you can see who assigned it to whom, when.

 

 

To quickly find items assigned to yourself or others on your team, search 'assigned:me', ‘assigned:username’, or 'assigned:unassigned' on the Items page.

 

 

We're excited to get this out into the wild. Especially for some of the larger teams using Rollbar. Let us know what you think and how we can make it better for you and your team.

 


Debugging Node.js Apps in Production with PyCharm

Posted by Cory Virok on December 19, 2014 · articles, javascript, nodejs

Node.js has a built-in debugger that you can start in running processes. To do this, send a SIGUSR1 signal to the running process and connect a debugger. The one, big caveat here is that the debugger only listens on the local interface, 127.0.0.1.

The following are instructions for debugging Node.js applications running in your company's private network from your laptop, through a bastion host.

  • SSH into the production host that is running the Node.js app
    • Put your production app into debug mode.
    • prod-host $> kill -s SIGUSR1 <pid> 
      
    • As root, start an SSH tunnel to connect your private network with localhost.
    • prod-host $> ssh -N -q -L <private-ip>:8585:localhost:5858 <private-ip>
      
  • On your laptop
    • Start an SSH tunnel to the production host, through your bastion host.
    • laptop $> ssh -N -q -L 5858:<private-ip>:8585 <username>@<bastion-host>
      
  • Open PyCharm and create a remote debugging configuration.
    • Run → Edit Configurations
    • Click the + button on the top-left of the window and select “Node.js Remote Debug”
    • Set the host to 127.0.0.1 using port 5858, name it and save.

  • Run the new Debug configuration.
    • Run → Debug...
    • Select the new configuration.

At this point your laptop will have connected to your local SSH tunnel which will be connected to your production host's private network interface which will be tunneled to your production host's local network interface and your Node.js process.

PyCharm → local SSH tunnel → bastion host → production host private network → production host localhost → Node.js

Set some breakpoints in PyCharm and watch as your production process begins waits for you to step through your app.

Note: If you'd rather use the command line instead of PyCharm just run the node debugger from your laptop:

laptop $> node debug localhost:5858

Happy debugging!

Troubleshooting

Sometimes PyCharm will just not connect to the running process on your production machine. Try restarting each of the SSH tunnels.

  1. Restart the SSH tunnel on the production machine.
  2. Restart the SSH tunnel on your laptop.
  3. Restart the PyCharm debugger.

RQL String Functions

Posted by Brian Rue on December 16, 2014 · changelog

RQL now includes a basic library of string functions. You can use these to slice and group your data in arbitrary ways. For example, "email domains with the most events in the past hour":

SELECT substring(person.email, locate('@', person.email)), count(*)
FROM item_occurrence
WHERE timestamp >= unix_timestamp() - 3600 AND person.email IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC

The new functions: concat, concat_ws, lower, upper, left, right, substring, locate, length, char_length. The functions are implemented to be compatible with MySQL; see the RQL docs for details.


Processing Delay Postmortem

Posted by Brian Rue on December 5, 2014

Yesterday from 2:20am PST until 10:22am PST, we experienced a service degredation that caused our customers to see processing delays reaching nearly 7 hours. While no data was lost, alerts were not being sent and new data was not appearing in the rollbar.com interface during this time.

We know that you rely on Rollbar to monitor your applications and alert when things go wrong, and we're very sorry that we let you down during this outage. We'd like to share some more details about what happened and what we're doing to prevent this kind of issue from happening again.

Overview of API Tier and Processing Pipeline

When data is received by our API endpoints (api.rollbar.com), it hits a load balancer which proxies to an array of "API servers". Those servers do some basic validation (access control, rate limiting, etc.) and then write the data to local disk. Next, a separate process (the "offline loader") loads these files asynchronously into our database clusters. Then, a series of workers process the raw data into the aggregated and indexed form you see on rollbar.com, and send alerts for any triggered events. This system is designed for reliability first and performance second.

When occurrence processing latency exceeds 30 seconds, we show an in-app notification that processing is behind. This is calculated as follows:

  • For each API server, calcuate ([timestamp of last occurrence received] - [for the last occurrence that was fully processed by the pipeline, the timestamp it was received on the API server])
  • Report the maximum delay of API servers as the processing delay

The API tier primarily receives three kinds of data: occurrences (the "item" endpoints), deploys, and symbol mapping files (source maps, Android Proguard files, and iOS dSYMs). Currently, all three of these are loaded by the same offline loader process, to different database clusters depending on the type of data.

Outage Timeline, Cause, and Resolution

At about 2:00am PST, a node in the database cluster that stores the symbol mapping files ran out of disk space. Unfortunately, this did not set off any alerts in our monitoring system because the disk space alert had been previously triggered and acknowledged, but not yet resolved.

At about 2:20am PST, the next symbol mapping file arrived on one of the API servers and since the database server was out of disk, it could not be loaded. This caused other files on that API server--containing occurrences and deploys--to not be loaded either. At this time, a processing delay first appeared in the Rollbar interface, and some (but not all) data was delayed. Over the next several hours, the delay continued to rise (as data on some API servers was not processed at all) and the percent of data that was delayed also rose (as more API servers enocuntered the same problem).

At 8:25am PST, a Rollbar engineer started work for the day and noticed a support ticket about the processing delay. He immediately escalated to a second engineer who began investigating. At 8:40am PST, a third engineer joined and updated status.rollbar.com to say that we are investigating the issue.

At 9:05am PST, we identified the immediate problem that the symbol mapping files were blocking occurrences from being loaded. We began mitigating by moving those files aside to allow the higher-priority occurrence data to load. This began the recovery process, but created a backlog at first level in the processing pipeline, causing all data to be delayed (instead of just some).

At 9:11am PST, we identified disk space as the root cause, and resolved this a few minutes later. At 9:35am PST, we updated status.rollbar.com to state that we had identified the issue.

At 9:55am PST, processing latency hit a peak of about 25,000 seconds. We updated status.rollbar.com with our estimate of 36 minutes to full recovery.

At 10:43am PST, processing was fully caught up. status.rollbar.com was updated a minute later.

Evaluating our Response

Once our team became aware of the issue, we were able to identify and fix it relatively quickly (40 minutes from awareness to identification, with fix immediately afterwards). Recovery was relatively fast as well, given the length of the backlog (1hr 38minutes to recover from 7hrs 45min of backlog).

It took far too long to for us to notice this issue, however, as our automated monitoring failed to alert us and we only discovered the issue via customer reports.

Improvements

We've identified and planned a number of improvements to our processes, tools, and systems to address what went wrong. Here are the highlights:

  • We're auditing our internal monitoring to ensure that checks are meaningful, needed checks are present, and alerts are functioning correctly
  • We've created a standing Google Hangout so we can more easily coordinate our response when we're not all in the same location
  • We're investigating whether we can automate the component status on status.rollbar.com, so that it doesn't need to be manually updated. (Side note: go there now and click "Subscribe to Updates"!)
  • We've identified and planned system-level improvements to decouple symbol mapping uploads from the critical occurrence processing pipeline
  • We've planned a project to improve our ability to recover more quickly from processing backlogs

In Conclusion

We're very sorry for the degradation of service yesterday. We know that you rely on Rollbar for critical areas of your operations and we hate to let you down. If you have any questions, please don't hesitate to contact us at support@rollbar.com.