Clustering events from a log file – with grep

written on 2019-06-24

Sometimes you need to track down an error in a software system by looking at log files.

Here's the useful "two greps and a count" pattern that has emerged for me:

Step 1: find the log lines that represent interesting events

Let's say you have a suspicion that there have been problems with your database connection. So let's find all lines that are related to that:

$ grep mariadb error.log

This will output possibly a lot of log lines that report errors regarding the database driver. When did they start? When did they stop? Let's take the next step.

Step 2: find some pattern to cluster by

In many cases, it is useful to find out about the distribution of the events over time. Let's extract the hour from each of the lines

$ grep mariadb error.log | grep -o " [0-9][0-9]:"

" [0-9][0-9]:" will match the hour part of the log line time stamp (adapt to match the time formatting of your particular log file). The -o switch will output only the matching part, not the entire line. The output will look like this:

 07:
 07:
 08:

Of course, you can choose to cluster by something other than hours, for example by the database server hostname, if you suspect that some host causes more error messages than the others.

Step 3: count occurrences per cluster

$ grep mariadb error.log | grep -o " [0-9][0-9]:" | sort | uniq -c

This will group and count the occurrences by the pattern extracted in step 2. The uniq command will reduce the input to, well, unique lines, the -c switch will tell you how often each line occurred. We also need sort, because uniq expects the input to be ordered, so the duplicate lines are next to each other.

So, of those interesting log lines, most appeared at eight-something, so during the hour between 08:00 and 08:59. Now you know when your trouble started and where to keep looking.