Sometimes you need to track down an error in a software system by looking at log files.
Here's the useful "two greps and a count" pattern that has emerged for me:
Let's say you have a suspicion that there has been problems with your database connection. So let's find all lines that are related to that:
$ grep mariadb error.log
This will output possibly a lot of log lines that report errors regarding the database driver. When did they start? When did they stop? Let's take the next step.
In many cases, it is useful to find out about the distribution of the events over time. Let's extract the hour from each of the lines
$ grep mariadb error.log | grep -o " [0-9][0-9]:"
" [0-9][0-9]:" will match the hour part of the log line time stamp (adapt to match the time formatting of your particular log file). The
-o switch will output only the matching part, not the entire line. The output will look like this:
07: 07: 08:
Of course, you can choose to cluster by something other than hours, for example by the database server hostname, if you suspect that some host causes more error messages than the others.
$ grep mariadb error.log | grep -o " [0-9][0-9]:" | sort | uniq -c
This will group and count the occurrences by the pattern extracted in step 2. The
uniq command will reduce the input to, well, unique lines, the
-c switch will tell you how often each line occurred. We also need
uniq expects the input to be ordered, so the duplicate lines are next to each other.
2 07: 249 08: 80 09: 16 10:
So, of those interesting log lines, most appeared at eight-something, so during the hour between 08:00 and 08:59. Now you know when your trouble started and where to keep looking.