I have very big logs (several gigabytes per day), that can (but do not need to) contain specific lines. I have to count the number of occurences of every one of these lines on a daily basis.
I have a file patterns.in, that contains the desired lines. For example:
aaaa
bbbb
cccc
dddd
eeee
ffff
The log files can look like this:
asd
dfg
aaaa
aaaa
sa
sdf
dddd
dddd
dddd
dddd
ghj
bbbb
cccc
cccc
cccc
fgg
fgh
hjk
The first (and perhaps most obvious approach) is to use grep, sort and uniq in the following way:
grep -f patterns.in logfile.txt | sort | uniq -c
which gives the following result:
2 aaaa
1 bbbb
3 cccc
4 dddd
It is close to what I want to achieve, but my desired result is:
2 aaaa
1 bbbb
3 cccc
4 dddd
0 eeee
0 ffff
So the problem is: how to print '0' if a line from pattern.in file is not matched? It needs to be done in a simplest possible way, as all I have available is the cygwin environment.
-ooption otherwise multiple lines that, e.g., contain 'aaaa' plus other (different) data will be treated as non-unique by thesort | uniq -c