Find performance problems in your web logs

A user recently reported a performance problem on one of our Rails apps. I pulled up New Relic's performance graphs and checked what was happening around the time that we received the email. Sure enough there was a massive spike in the time taken to serve a request.

We gave the problem a couple of minutes thought and came up with a hypothesis; we've got one long running request in this app that really ought to be run in a background job. Could that request be responsible? Due to the way we've configured our web servers, there would need to be at least 8 concurrent requests for that page.

So how many concurrent requests were there? Each one creates a line in the Rails log file like this:

Processing MyController#create (for 172.16.5.45 at 2009-11-16 04:03:44) [POST]

The key thing here is that the time is in the log file, so we can get a rough feel for usage by counting how many requests MyController#create receives in a minute.

You could write a script to work that out for you. It probably wouldn't take more than 10 minutes, but it feels like a bit of a chore and -- frankly -- I'm too lazy to bother.

Alternatively, you could spend 10 seconds cooking up a Unix command line like this one:

$ zgrep MyController#create production.log.1.gz | \
    cut -d ' ' -f 7 | cut -d : -f 1,2 | \
    sort | uniq -c | sort -nr | head -n 5
  18 18:21
  16 17:04
  15 18:31
  14 22:47
  14 20:42

Neat. Looks really clever, but it's actually really easy and anybody can use it.

The output tell us how many times MyController#create was called per minute, for the five busiest minutes in the log file. To give you an idea of how to read it, we peaked at 18 requests per minute at 18:21. Given that we already know that we can only support 8 of these requests at once there's a very good chance we've found our performance problem. It's probably time to rewrite the relevant code as a background job.

To nobody's surprise, our customer feedback dates from 18:30, which is also when the New Relic graphs indicated that we had a performance problem (requests were backing up in Mongrel's queue while the Mongrel servers were tied up by our long running create action).

Let's break that command down a bit while I show you how I built it.

Step 1: Finding the time of each request

Step 1 is to have a quick look at some example log messages to see what information we've got to play with. All we really need is the time of each request, and lo and behold, there it is...

$ zgrep MyController#create production.log.1.gz | head -n 1
Processing MyController#create (for 172.16.5.45 at 2009-11-16 04:03:44) [POST]

zgrep is just like grep, but it can read compressed logs.

The time the request was made is the seventh field in the log message, if you split the line on white space. The cut command can do that for us, like so:

$ zgrep MyController#create production.log.1.gz | \
    cut -d ' ' -f 7 | head -n 3
  04:03:44)
  04:03:48)
  04:04:51)

The -d switch specifies the delimiter (a space) and -f tells cut which fields we want. We need to clean "04:03:44)" up a bit. As we're counting the number of requests per minute we can throw away the seconds. We can do that with cut again:

$ zgrep MyController#create production.log.1.gz | \
    cut -d ' ' -f 7 | cut -d : -f 1,2 | head -n 3
  04:03
  04:03
  04:04

Step 2: Counting requests per minute

We can count each minute with the uniq command. Normally uniq will just remove all the duplicate lines from a sorted list, but with the -c switch we can ask it to tell us how many times each line occurs.

$ zgrep MyController#create production.log.1.gz | \
    cut -d ' ' -f 7 | cut -d : -f 1,2 | \
    sort | uniq -c | head -n 3
  2 04:03
  1 04:04
  3 04:06

uniq has added the count in front of the time. Now we're getting somewhere (though there's clearly not a lot going on at 4 o'clock in the morning)!

Note that uniq requires it's input to be sorted, so I've piped the output of cut through sort first. I really didn't need to do that in this case as the lines in my web logs are already in chronological order, but it's a good habit to get into when using uniq -c.

Step 3: Finding the hotspots

Okay, so how do we find out when things start to get busy? We can use sort again, and ask it to list the rows in descending order of popularity. sort -r will do that for us, but we also need to point out to sort that it's sorting numbers (otherwise when sorting "23" and "100" it will just compare "2" to "1" and conclude that "23" is bigger). So that gives us sort -nr:

$ zgrep MyController#create production.log.1.gz | \
    cut -d ' ' -f 7 | cut -d : -f 1,2 | \
    sort | uniq -c | sort -nr | head -n 5
  18 18:21
  16 17:04
  15 18:31
  14 22:47
  14 20:42

And there you have it -- it doesn't seem so clever now does it?

Update: Andrew McDonough got in touch on Twitter to show me how he'd do it with awk. Nice.

I love feedback and questions — please get in touch on Twitter or leave a comment.