Counting occurrences of unique strings in bash without first sorting the data

Question

I'm doing some data gathering on massive log files and I need to count the occurrences of unique strings. Generally the way this is done is with a command like:

zcat <file> | grep -o <filter> | sort | uniq -c | sort -n

What I'm looking to do is not pay the performance penalty of the sort after the grep. Is this possible to do without leaving bash?

Similar (unique without sorting): stackoverflow.com/questions/15797442/… — xdhmoore
– xdhmoore, Commented Oct 23, 2020 at 3:41

anubhava · Accepted Answer · 2016-01-30 16:03:25Z

5

You can use awk to count the uniques and avoid sort:

zgrep -o <filter> <file> |
awk '{count[$0]++} END{for (i in count) print count[i], i}'

Also note you can avoid zcat and call zgrep directly.

edited Jan 30, 2016 at 16:03

answered Jan 30, 2016 at 0:02

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Dirk Herrmann · Accepted Answer · 2016-01-30 00:41:59Z

1

Since you mentioned you don't want to leave bash: You could try it using associative arrays: You could use the input lines as key, and the count as value. To learn about associative arrays see here: http://www.gnu.org/software/bash/manual/html_node/Arrays.html.

But, be sure to benchmark the performance - you may nevertheless be better off using sort and uniq, or perl, or ...

answered Jan 30, 2016 at 0:41

Dirk Herrmann

5,9971 gold badge23 silver badges54 bronze badges

1 Comment

ralar Over a year ago

Part of the motivation in not using sort is also disk space. These hosts have very little of it on the non-log drive (which is read-only). Sort blows it away for large queries because it caches it's list on disk. That said, I will be benchmarking the performance.

peak · Accepted Answer · 2016-01-30 05:58:36Z

1

jq has built-in associative arrays, so you could consider one of the following approaches, which are both efficient (like awk):

zgrep -o <filter> <file> |
  jq -nR 'reduce inputs as $line ({}; .[$line] += 1)'

This would produce the results as a JSON object with the frequencies as the object's values, e.g.

{
  "a": 2,
  "b": 1,
  "c": 1
}

If you want each line of output to consist of a count and value (in that order), then an appropriate jq invocation would be:

jq -nRr 'reduce inputs as $line ({}; .[$line] += 1)
         | to_entries[] | "\(.value) \(.key)"'

This would produce output like so:

2 a
1 b
1 c

The jq options used here are:

-n # for use with `inputs`
-R # "raw" input
-r # "raw" output

answered Jan 30, 2016 at 5:58

peak

119k21 gold badges185 silver badges218 bronze badges

Collectives™ on Stack Overflow

Counting occurrences of unique strings in bash without first sorting the data

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related