The simplest method to count lines matching specific patterns, including '0' if line is not found?

Question

I have very big logs (several gigabytes per day), that can (but do not need to) contain specific lines. I have to count the number of occurences of every one of these lines on a daily basis.

I have a file patterns.in, that contains the desired lines. For example:

aaaa
bbbb
cccc
dddd
eeee
ffff

The log files can look like this:

asd
dfg
aaaa
aaaa
sa
sdf
dddd
dddd
dddd
dddd
ghj
bbbb
cccc
cccc
cccc
fgg
fgh
hjk

The first (and perhaps most obvious approach) is to use grep, sort and uniq in the following way:

grep -f patterns.in logfile.txt | sort | uniq -c

which gives the following result:

   2 aaaa
   1 bbbb
   3 cccc
   4 dddd

It is close to what I want to achieve, but my desired result is:

   2 aaaa
   1 bbbb
   3 cccc
   4 dddd
   0 eeee
   0 ffff

So the problem is: how to print '0' if a line from pattern.in file is not matched? It needs to be done in a simplest possible way, as all I have available is the cygwin environment.

BTW, you might need grep's -o option otherwise multiple lines that, e.g., contain 'aaaa' plus other (different) data will be treated as non-unique by the sort | uniq -c — cas
– cas, Commented Oct 11, 2012 at 5:36

iruvar · Accepted Answer · 2012-10-10 16:58:24Z

8

how about feeding the pattern file back in as a data file so that each pattern finds at least one match, and then subtracting one from the final reported count for each match

grep -f patterns.in logfile.txt patterns.in | cut -f2 -d':' | sort | uniq -c | awk '{print($1 - 1" "$2)}'

answered Oct 10, 2012 at 16:58

iruvar

17k8 gold badges51 silver badges81 bronze badges

Seems to be working well on examples, will need to check tommorow in production, as I am not sure if I have awk available there.

Paweł Rumian
– Paweł Rumian

2012-10-10 18:33:08 +00:00
Commented Oct 10, 2012 at 18:33
2

+1, nice answer. you can use grep's -h or --no-filename option to stop grep from printing filenames. e.g. grep -h -o -f patterns.in logfile.txt patterns.in sort | uniq -c | awk '{print($1 - 1" "$2)}'

cas
– cas

2012-10-11 05:37:55 +00:00
Commented Oct 11, 2012 at 5:37
It almost works, with one exception - awk '{print($1 - 1" "$2)}' prints only the first word of matched line (second field, $2). If a line has multiple words, how can I write its whole contents, that is from $2 to end of line?

Paweł Rumian
– Paweł Rumian

2012-10-11 09:52:45 +00:00
Commented Oct 11, 2012 at 9:52
2

@gorkypl, incorporated Craig Sander's grep -h suggestion and modified the command to work with multiple input words, here goes. grep -h -f patterns.in logfile.txt patterns.in | sort | uniq -c | tr -s ' ' |awk ' {count=$1 - 1; file_name=$0; sub($1, "", file_name);print(count" "file_name)}'

iruvar
– iruvar

2012-10-11 16:57:07 +00:00
Commented Oct 11, 2012 at 16:57

Add a comment |

Stack Exchange Network

The simplest method to count lines matching specific patterns, including '0' if line is not found?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

The simplest method to count lines matching specific patterns, including '0' if line is not found?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions