Using Logstash and Rollbar Together

Posted by Ken Sheppardson on March 2, 2015

The infrastructure behind most modern web applications includes an assortment of tools for collecting server and application metrics, logging events, aggregating logs, and providing alerts. Most systems are made up of a collection of best-in-class tools and services, selected and deployed over time as team members arrive and depart, needs change, the system grows, and new tools are introduced. One of the challenges web development and operations teams face is collecting and analyzing data from these disparate sources and systems and then piecing together what’s happening by looking at multiple reports and dashboards.

Two common pieces in this puzzle are Logstash and Rollbar.

Logstash (and the Kibana web interface, both of which are heavily supported by and integrated with Elasticsearch) lets you collect and parse logs, store them in a central location, search and explore the data via the Kibana UI, and output events to other services. Logstash provides a powerful tool for taking logs in many different formats, converting them into JSON events, then routing and storing those events.

 

kibana-screenshot

 

Rollbar collects errors from your application, notifies you of those errors, and analyzes them so you can more efficiently debug and fix them. With a few lines of code or config changes to your application, you can make errors, complete stack traces, trends and affected user reports accessible via your Roller dashboard. Like Logstash, Rollbar collects and analyzes events represented in JSON.

 

rollbar-screenshot

 

 

By connecting Logstash and Rollbar, you can not only centralize and analyze your system and application logs, but also improve error tracking and simplify debugging by providing context to developers looking at errors generated by their code.

Most Logstash users have tried to configure Logstash to parse multi-line exception messages, or have tried to convince a development team to adopt standards for application debugging and error message. For code your team controls, it’s likely much simpler to install the Rollbar notifier for the language and framework you’re using. You can send errors from your RubyPython, or PHP application or browser JavaScript to Rollbar and the service will parse stack traces automatically and update your dashboard in real time. You can see the values of local variables when the error occurred, and these errors are associated with other errors of the same type.

For Rollbar users, Logstash allows you to collect errors from external applications and ship them to Rollbar, where they'll appear on the same dashboard as your application errors. Database and web server errors, for example, can be passed along to Rollbar to help developers determine whether the error is due to a bug, database performance issue, or operational issue with the web server.

To get started...


Increasing max-open files for beanstalkd

Posted by Cory Virok on February 28, 2015 · engineering, infrastructure

Quick tip: If you are running out of file descriptors in your Beanstalkd process, use /etc/default/beanstalkd to set the ulimit before the init script starts the process.

e.g.

 

# file: /etc/default/beanstalkd
BEANSTALKD_LISTEN_ADDR=127.0.0.1
BEANSTALKD_LISTEN_PORT=11300
START=yes
BEANSTALKD_EXTRA="-b /var/lib/beanstalkd -f 1"

# Should match your /etc/security/limits.conf settings
ulimit -n 100000

 

Lot's of resources online tell you to update your /etc/security/limits.conf and /etc/pam.d/common-session* settings to increase your maximum number of available file descriptors. However, the default beanstalkd installation on Ubuntu 12.04+ uses an init script that starts the daemon process using start-stop-daemon which does not use your system settings when setting the processes ulimits. Just add this line to your defaults and you're good to go!

 


Assign errors to your team

Posted by Mike Smith on February 26, 2015

Ever wanted to assign error items to other team members in Rollbar? Of course you have. Now you can. It is a pretty straight forward enhancement, but here is an overview. 

On the error ‘items’ details page, there's an “Assigned to" dropdown with the members of your team. Once assigned, we’ll shoot an email to that team member letting them know you assigned that specific item to them, including link and details. They'll be automatically added as a 'watcher' for that specific item and will receive notifications about any comments and updates. 

 

 

Assignment events will be listed in the item history section, so you can see who assigned it to whom, when.

 

 

To quickly find items assigned to yourself or others on your team, search 'assigned:me', ‘assigned:username’, or 'assigned:unassigned' on the Items page.

 

 

We're excited to get this out into the wild. Especially for some of the larger teams using Rollbar. Let us know what you think and how we can make it better for you and your team.

 


Debugging Node.js Apps in Production with PyCharm

Posted by Cory Virok on December 19, 2014 · articles, javascript, nodejs

Node.js has a built-in debugger that you can start in running processes. To do this, send a SIGUSR1 signal to the running process and connect a debugger. The one, big caveat here is that the debugger only listens on the local interface, 127.0.0.1.

The following are instructions for debugging Node.js applications running in your company's private network from your laptop, through a bastion host.

  • SSH into the production host that is running the Node.js app
    • Put your production app into debug mode.
    • prod-host $> kill -s SIGUSR1 <pid> 
      
    • As root, start an SSH tunnel to connect your private network with localhost.
    • prod-host $> ssh -N -q -L <private-ip>:8585:localhost:5858 <private-ip>
      
  • On your laptop
    • Start an SSH tunnel to the production host, through your bastion host.
    • laptop $> ssh -N -q -L 5858:<private-ip>:8585 <username>@<bastion-host>
      
  • Open PyCharm and create a remote debugging configuration.
    • Run → Edit Configurations
    • Click the + button on the top-left of the window and select “Node.js Remote Debug”
    • Set the host to 127.0.0.1 using port 5858, name it and save.

  • Run the new Debug configuration.
    • Run → Debug...
    • Select the new configuration.

At this point your laptop will have connected to your local SSH tunnel which will be connected to your production host's private network interface which will be tunneled to your production host's local network interface and your Node.js process.

PyCharm → local SSH tunnel → bastion host → production host private network → production host localhost → Node.js

Set some breakpoints in PyCharm and watch as your production process begins waits for you to step through your app.

Note: If you'd rather use the command line instead of PyCharm just run the node debugger from your laptop:

laptop $> node debug localhost:5858

Happy debugging!

Troubleshooting

Sometimes PyCharm will just not connect to the running process on your production machine. Try restarting each of the SSH tunnels.

  1. Restart the SSH tunnel on the production machine.
  2. Restart the SSH tunnel on your laptop.
  3. Restart the PyCharm debugger.

RQL String Functions

Posted by Brian Rue on December 16, 2014 · changelog

RQL now includes a basic library of string functions. You can use these to slice and group your data in arbitrary ways. For example, "email domains with the most events in the past hour":

SELECT substring(person.email, locate('@', person.email)), count(*)
FROM item_occurrence
WHERE timestamp >= unix_timestamp() - 3600 AND person.email IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC

The new functions: concat, concat_ws, lower, upper, left, right, substring, locate, length, char_length. The functions are implemented to be compatible with MySQL; see the RQL docs for details.


Processing Delay Postmortem

Posted by Brian Rue on December 5, 2014

Yesterday from 2:20am PST until 10:22am PST, we experienced a service degredation that caused our customers to see processing delays reaching nearly 7 hours. While no data was lost, alerts were not being sent and new data was not appearing in the rollbar.com interface during this time.

We know that you rely on Rollbar to monitor your applications and alert when things go wrong, and we're very sorry that we let you down during this outage. We'd like to share some more details about what happened and what we're doing to prevent this kind of issue from happening again.

Overview of API Tier and Processing Pipeline

When data is received by our API endpoints (api.rollbar.com), it hits a load balancer which proxies to an array of "API servers". Those servers do some basic validation (access control, rate limiting, etc.) and then write the data to local disk. Next, a separate process (the "offline loader") loads these files asynchronously into our database clusters. Then, a series of workers process the raw data into the aggregated and indexed form you see on rollbar.com, and send alerts for any triggered events. This system is designed for reliability first and performance second.

When occurrence processing latency exceeds 30 seconds, we show an in-app notification that processing is behind. This is calculated as follows:

  • For each API server, calcuate ([timestamp of last occurrence received] - [for the last occurrence that was fully processed by the pipeline, the timestamp it was received on the API server])
  • Report the maximum delay of API servers as the processing delay

The API tier primarily receives three kinds of data: occurrences (the "item" endpoints), deploys, and symbol mapping files (source maps, Android Proguard files, and iOS dSYMs). Currently, all three of these are loaded by the same offline loader process, to different database clusters depending on the type of data.

Outage Timeline, Cause, and Resolution

At about 2:00am PST, a node in the database cluster that stores the symbol mapping files ran out of disk space. Unfortunately, this did not set off any alerts in our monitoring system because the disk space alert had been previously triggered and acknowledged, but not yet resolved.

At about 2:20am PST, the next symbol mapping file arrived on one of the API servers and since the database server was out of disk, it could not be loaded. This caused other files on that API server--containing occurrences and deploys--to not be loaded either. At this time, a processing delay first appeared in the Rollbar interface, and some (but not all) data was delayed. Over the next several hours, the delay continued to rise (as data on some API servers was not processed at all) and the percent of data that was delayed also rose (as more API servers enocuntered the same problem).

At 8:25am PST, a Rollbar engineer started work for the day and noticed a support ticket about the processing delay. He immediately escalated to a second engineer who began investigating. At 8:40am PST, a third engineer joined and updated status.rollbar.com to say that we are investigating the issue.

At 9:05am PST, we identified the immediate problem that the symbol mapping files were blocking occurrences from being loaded. We began mitigating by moving those files aside to allow the higher-priority occurrence data to load. This began the recovery process, but created a backlog at first level in the processing pipeline, causing all data to be delayed (instead of just some).

At 9:11am PST, we identified disk space as the root cause, and resolved this a few minutes later. At 9:35am PST, we updated status.rollbar.com to state that we had identified the issue.

At 9:55am PST, processing latency hit a peak of about 25,000 seconds. We updated status.rollbar.com with our estimate of 36 minutes to full recovery.

At 10:43am PST, processing was fully caught up. status.rollbar.com was updated a minute later.

Evaluating our Response

Once our team became aware of the issue, we were able to identify and fix it relatively quickly (40 minutes from awareness to identification, with fix immediately afterwards). Recovery was relatively fast as well, given the length of the backlog (1hr 38minutes to recover from 7hrs 45min of backlog).

It took far too long to for us to notice this issue, however, as our automated monitoring failed to alert us and we only discovered the issue via customer reports.

Improvements

We've identified and planned a number of improvements to our processes, tools, and systems to address what went wrong. Here are the highlights:

  • We're auditing our internal monitoring to ensure that checks are meaningful, needed checks are present, and alerts are functioning correctly
  • We've created a standing Google Hangout so we can more easily coordinate our response when we're not all in the same location
  • We're investigating whether we can automate the component status on status.rollbar.com, so that it doesn't need to be manually updated. (Side note: go there now and click "Subscribe to Updates"!)
  • We've identified and planned system-level improvements to decouple symbol mapping uploads from the critical occurrence processing pipeline
  • We've planned a project to improve our ability to recover more quickly from processing backlogs

In Conclusion

We're very sorry for the degradation of service yesterday. We know that you rely on Rollbar for critical areas of your operations and we hate to let you down. If you have any questions, please don't hesitate to contact us at support@rollbar.com.


October Release Roundup

Posted by Brian Rue on October 31, 2014

Happy Halloween, everyone! Here's a roundup of what's new in Rollbar this month. 

Ruby Upgrades

The rollbar gem for Ruby got a lot of attention in October. Early in the month, we released version 1.1.0, which added support for Ruby 2.1 exception causes, and a new 'failover_handlers' feature for more reliable asnyc reporting. Mid-month, we released version 1.2 which adds a new, much nicer and more powerful interface for sending the data you want into Rollbar. 

In 1.2, you can do:

begin
  Rollbar.info("About to do_something")
  do_something
rescue => e
  # send a message and extra data along with an exception
  Rollbar.error("Something went wrong", e, :foo => "bar")

  # customize payload attributes, like the 'person' or 'fingerprint'
  Rollbar.scope({:fingerprint => "something"}).error(e)
end

More in the docs. It's available now on Rubygems (latest version is 1.2.7). 

New Status Site

We've upgraded status.rollbar.com. We'll be using it to communicate about outages, so if you'd like to be notified, go there and subscribe to updates. The new status site also shows the current maximum latencies for the processing pipeline.

Link Rollbar Items with Existing 3rd-party Issues

You can now link a Rollbar item with an existing issue in your issue tracker:

Or if you have a Rollbar item that is already linked, you can now change or remove the link. This works with Asana, GitHub Issues, JIRA, Pivotal Tracker, Sprintly, and Trello.

Geolocation for IP Addresses

Rollbar now shows geolocation information on the IP address detail page:

You can get there by clicking on an IP address anywhere in the app. More features on this theme to come; stay tuned.

Basic stats on the Item page

The basic item statistics--when it was first and last seen, how many times it has occurred since being resolved, and how many total IPs have been affected--are finally available at a glance on the Item detail page.

Time Format setting

If you'd rather see timestamps use a  24-hour clock, head over to your project's Settings page:

 

Misc

  • Improved processing pipeline speed by ~33%
  • Added API methods to set rate limits
  • Improved support for code versions in rollbar-android 0.1.0 
  • Fixed a bug with wrapped functions in rollbar.js 1.1.11
  • Added links to detailed usage reports from the Invoice and Usage pages
  • Fixed an issue where rate limits would over-limit when duplicate items were received
  • Deploy emails now include the deploy comment
  • RQL aggregate functions now support arbitrary expressions as parameters
  • Improved filename linking for vanilla ruby stack traces
  • Added the High Occurrence Rate trigger for HipChat notifications
  • Long param values are now displayed better on the occurrence detail page
  • The "viewing now" list no longer shows yourself in the list
  • Fixed a few edge cases with the Slack integration
  • Fixed a bug where the trace_chain section in the Raw JSON part of the occurrence detail page would sometimes be reversed
  • Fixed an issue where gravatar images would block the page load

RQL minor updates

Posted by Brian Rue on September 25, 2014 · changelog

A couple minor updates to RQL today:

  • IS NULL and IS NOT NULL are now supported
  • Fixed a crash in queries that contain a GROUP BY plus an ORDER BY on a column referenced only in the ORDER BY.

 


New "Reports" API calls

Posted by Brian Rue on August 20, 2014 · changelog

We've released two new API calls, exposing some of the data on the Dashboard via our JSON API.

Use the /reports/top_active_items to fetch the same data as "Top 10 Active Items in last 24 hours". And use /reports/occurrence_counts to fetch the same data as "Daily Error/Critical Occurrences" and "Hourly Error/Critical Occurrences".

More details in the docs. If you give this a try, send us any feedback at team@rollbar.com.


Occurrence counts by minute

Posted by Brian Rue on July 31, 2014 · product, changelog

We've released an improvement to our Item Detail pages, adding a graph showing the aggregate occurrence counts per minute. It's live now for everyone and looks like this:

You can use this to see patterns that previously were hard to spot, like errors that occur on a regular, sub-hour interval (like the one shown above). It's also useful for quickly seeing how the occurrence rate changes after a deploy.