3

I'm doing some data gathering on massive log files and I need to count the occurrences of unique strings. Generally the way this is done is with a command like:

zcat <file> | grep -o <filter> | sort | uniq -c | sort -n

What I'm looking to do is not pay the performance penalty of the sort after the grep. Is this possible to do without leaving bash?

1

3 Answers 3

5

You can use awk to count the uniques and avoid sort:

zgrep -o <filter> <file> |
awk '{count[$0]++} END{for (i in count) print count[i], i}'

Also note you can avoid zcat and call zgrep directly.

Sign up to request clarification or add additional context in comments.

Comments

1

Since you mentioned you don't want to leave bash: You could try it using associative arrays: You could use the input lines as key, and the count as value. To learn about associative arrays see here: http://www.gnu.org/software/bash/manual/html_node/Arrays.html.

But, be sure to benchmark the performance - you may nevertheless be better off using sort and uniq, or perl, or ...

1 Comment

Part of the motivation in not using sort is also disk space. These hosts have very little of it on the non-log drive (which is read-only). Sort blows it away for large queries because it caches it's list on disk. That said, I will be benchmarking the performance.
1

jq has built-in associative arrays, so you could consider one of the following approaches, which are both efficient (like awk):

zgrep -o <filter> <file> |
  jq -nR 'reduce inputs as $line ({}; .[$line] += 1)'

This would produce the results as a JSON object with the frequencies as the object's values, e.g.

{
  "a": 2,
  "b": 1,
  "c": 1
}

If you want each line of output to consist of a count and value (in that order), then an appropriate jq invocation would be:

jq -nRr 'reduce inputs as $line ({}; .[$line] += 1)
         | to_entries[] | "\(.value) \(.key)"'

This would produce output like so:

2 a
1 b
1 c

The jq options used here are:

-n # for use with `inputs`
-R # "raw" input
-r # "raw" output

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.