Awk: Removing duplicate lines without sorting after matching conditions

Question

I've got a list of devices which I need to remove duplicates (keep only the first occurrence) while preserving order and matching a condition. In this case I'm looking for a specific string and then printing the field with the device name. Here is some example raw data from the sar application:

10:02:01 AM       sdc      0.70      0.00      8.13     11.62      0.00      1.29      0.86      0.06
10:02:01 AM       sda      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
10:02:01 AM       sdb      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:          sdc      1.31      3.73     99.44     78.46      0.02     17.92      0.92      0.12
Average:          sda      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:          sdb      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
10:05:01 AM       sdc      2.70      0.00     39.92     14.79      0.02      5.95      0.31      0.08
10:05:01 AM       sda      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
10:05:01 AM       sdb      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
10:06:01 AM       sdc      0.83      0.00     10.00     12.00      0.00      0.78      0.56      0.05
11:04:01 AM       sda      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:04:01 AM       sdb      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:          sdc      0.70      2.55      8.62     15.91      0.00      1.31      0.78      0.05
Average:          sda      0.12      0.95      0.00      7.99      0.00      0.60      0.60      0.01
Average:          sdb      0.22      1.78      0.00      8.31      0.00      0.54      0.52      0.01

The following will give me the list of devices from lines containing the word "average" but it sorts the output:

sar -dp | awk '/Average/ {devices[$2]} END {for (device in devices) {print device}}'
sda
sdb
sdc

The following gives me exactly what I want (command from here):

sar -dp | awk '/Average/ {print $2}' | awk '!devices[$0]++'
sdc
sda
sdb

Maybe I'm missing something painfully obvious but I can't figure out how to do the same in one awk command, that is without piping the output of the first awk into the second awk.

Jotne · Accepted Answer · 2014-07-17 17:57:00Z

3

You can do:

sar -dp | awk '/Average/ && !devices[$2]++ {print $2}' 
sdc
sda
sdb

The problem is this part for (device in devices). For some reason the for does randomize the output.
I have read a long complicated information on why some where but have not the link.

edited Jul 17, 2014 at 17:57

answered Jul 17, 2014 at 17:51

Jotne

41.7k13 gold badges54 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Etan Reisner Over a year ago

awk makes no claims about order of keys retrieved from an array as far as I know. Though in awk 4 you can inform it about the sorting to use when retrieving keys (but I don't know if "input order" is an option).

Ed Morton Over a year ago

Awk arrays are stored as hash tables for efficiency. The in operator retrieves the elements from the array in the order they are stored in memory, i.e. in whatever order the hashing algorithm arranges them. If you need an array traversed in a specific order you need to decide which order (insertion order? alphabetical? numerical? by element? by index? something else?) and program that order somehow. With GNU awk you can assign an order by populating PROCINFO["sorted_in"], see gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array.

Jotne Over a year ago

@EdMorton Thanks for the refreshment. My memory is some limited and for some reason has stared to remove stuff by it self without telling me :) This is the link to the sorted_in gnu.org/software/gawk/manual/…

Ed Morton Over a year ago

@Jotne tell me about it. I learned French in school and a few years ago started learning Spanish which I eventually realized was just pushing the French out of my brain to make room. The net result is that I now can speak neither of them and am just barely holding onto English....

Etan Reisner · Accepted Answer · 2014-07-17 17:52:57Z

1

awk '/Average/ && !devices[$2]++ {print $2}' sar.in

You just need to combine the two tests. The only caveat is that in the original the entire line is field two from the original input so you need to replace $0 with $2.

answered Jul 17, 2014 at 17:52

Etan Reisner

81.7k8 gold badges120 silver badges154 bronze badges

1 Comment

Jotne Over a year ago

This looks very like my post :)

Collectives™ on Stack Overflow

Awk: Removing duplicate lines without sorting after matching conditions

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related