Last week we fixed a bug in our daily summary emails that a customer reported on May 1st. The bug caused some of the emails to miss information on errors that occurred during certain parts of the day. This was a pretty serious bug. It affected almost all of our customers and had been around for a very long time.
The issue ended up being a miscalculation in our timezone handling code. Most developers who have worked with timezones know how difficult it can be to get it right. I'll explain how we did it and how Rollbar helped us quickly narrow in on the problem.
- The occurrence counts in some of the daily summary emails were too low
- Some new errors were not even in the daily summary emails
The first thing I did was to recreate the database query by hand. After reading the code and pulling out the relevant queries, I was able to recreate the daily summaries for production but in a development environment. At this point, it was clear that we were missing data but the query was correct.
Some possible culprits:
- MySQL slave delay
- Long-running MySQL transactions
- Nebulous time-zone related bug in our code
I was able to verify that MySQL slave delay was not the cause by looking at our Scout graphs for the database host. Slave delay was nowhere near where it needed to be to cause this.
Next, I looked for long-running MySQL transactions. We have quite a few long-running connections so
it was impossible to determine if we had long-running transactions by looking at just the output
SHOW FULL PROCESSLIST.
I added some code to the script that sends the daily summary emails to print out the maximum id on a
table in our database during each run of the script. If the maximum id never changed, it would mean
the transaction in which the query was executing was never being closed, (since we use
REPEATABLE READ by default on all connections.)
The ids were changing. This wasn't the problem…
At this point it was clear that the problem was in our code. I had pored over it for a couple of hours, testing everything and still could not find the problem. Then, I decided to add some Rollbar debugging to get more information about the code that was running in production.
After looking at the first occurrence it was clear what the problem was. The
were incorrect. This code path was pretty straight forward and it was easy to find the culprit -
The docs for
time.mktime() say that it assumes the
datetime object passed in uses the local timezone. Our code was passing in a
datetime object for
UTC which was getting ignored when we generated the timestamps for our query.
The fix was a one-liner. Instead of using
time.mktime() we now us
calendar.timegm() since it
Happy coding and watch out for timezone bugs!