Awk command to get unique lines in a file with multiple column

Question

I am trying to get the unique lines in a file with multiple columns.

My file "file.txt" contains sample record below

20230830,52678,004,Apple,21
20230830,52678,004,Apple,20
20230830,52678,004,Apple,19
20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

My desired output is to print only uniques lines from column 1 to 4. Regardless of the value on their column 5

20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

I tried using sed to add a unique separator between columns 1-4 and column 5

And then I use awk command to get unique lines from col 1-4

sed 's/,/|/4' file.txt | awk -F"|" '{arr[$1]++} END{for(i in arr) if(arr[i]==1) print $0}'

With this code, it works with small set of data but when I use in a file with 1000 lines, I get...

20230831,47689,001,Orange,15
20230831,47689,001,Orange,15
20230831,47689,001,Orange,15
20230831,47689,001,Orange,15
...

unique values keeps on comming. They are duplicating. Seems like I'm only getting one unique line and it's keeps duplicating.

Can you help me if there's something wrong with my code?

I am expecting to print only unique lines like this

20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

What part of your code is supposed to eliminate duplicates?

Scott Hunter
– Scott Hunter

2023-09-01 14:41:45 +00:00
Commented Sep 1, 2023 at 14:41 — Scott Hunter
– Scott Hunter, Commented Sep 1, 2023 at 14:41
@markp-fuso I've edited the code. You can try it again

Candy
– Candy

2023-09-01 14:53:34 +00:00
Commented Sep 1, 2023 at 14:53 — Candy
– Candy, Commented Sep 1, 2023 at 14:53

ufopilot · Accepted Answer · 2023-09-03 15:46:00Z

4

$ awk -F'[^,]*$' 'FNR==NR{cnt[$1]++; next} cnt[$1]==1' inputfile inputfile
20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

As Ed Morton mentioned in the comment, the last field is the FS -F'[^,]*$', so we can use the rest (from the begining to the last ,) as key $1

In the first processing we count the key cnt[$1]++

In the second one we look for cnt[$1]==1 to get the unique keys

edited Sep 3, 2023 at 15:46

answered Sep 2, 2023 at 8:45

ufopilot

3,9852 gold badges13 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

glenn jackman Over a year ago

I like the concept of processing the file twice. Except the key is the concatenation of the first 4 fields.

Ed Morton Over a year ago

@glennjackman ufopilot is using the whole 5th field as the FS so $1 in the script is the concatenation of the first 4 ,-separated fields. Cute.

Ed Morton Over a year ago

@ufopilot I do like that idea you used of setting FS in a way that it'd match the last field to isolate the first 4 fields to use as a key and obviously it can be expanded to -F'(,[^,]*){n}$' to isolate the first NF-n fields (n can even be calculated in the script), I don't think I've seen that before and may have to adopt it for similar problems in future. Thanks!

dawg · Accepted Answer · 2023-09-03 14:41:31Z

2

This Unix pipe will identify lines in your file that are duplicates based on the first 4 fields:

$ cut -d, -f 1-4 file | uniq -d
20230830,52678,004,Apple

You can then use grep to inverse that match so duplicates are skipped:

$ grep -vF -f <(cut -d, -f 1-4 file | uniq -d) file

Prints:

20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

The advantage here is lower memory requirement since you do not have to hold the whole file in memory.

Ed Morton brings up legit point that the first four columns of a,b,c,d,e as a fixed string might match Xa,b,c,d in grep.

To solve this, use awk instead of grep:

awk -F, '{k=$1 FS $2 FS $3 FS $4} FNR==NR{d[k]; next} !(k in d)' <(cut -d , -f 1-4 file | uniq -d ) file

This does have the advantage of only holding the dups rather than the entire file (if you are using a Rasberry Pi) but in most environments use awk alone.

edited Sep 3, 2023 at 14:41

answered Sep 1, 2023 at 15:55

dawg

105k24 gold badges143 silver badges217 bronze badges

7 Comments

Ed Morton Over a year ago

That will fail when the first 4 fields are a subset of a different 4 fields, e.g. if a,b,c,d,e and Xa,b,c,dY,e both exist in the input. It'll also fail if the input can contain regexp metachars, e.g. if a,.*,c,d,e exists in the input. Unfortunately you can't solve those problems without introducing a call to sed or similar to transform the cut | uniq output before running grep using it.

dawg Over a year ago

The grep -F command is for fixed strings so metachar and substrings are ignored, no?

Ed Morton Over a year ago

grep -F would solve the metachar problem but not the substrings one. You could use sed to provide anchors around the cut | uniq output to solve the substring problem but then you'd need to remove -F and then you'd have the metachars problem (which you could do more sed escaping to solve). With different data you could use -Fx to solve both of those problems but that won't work in this case since they need to just match part of the line.

Ed Morton Over a year ago

Right but then you'd still have a,b,c,d,e vs Xa,b,c,d,e to deal with. That was the point of my comment - you need to introduce 1 or more calls to sed or similar to make this approach work robustly.

dawg Over a year ago

Fair enough. Use awk instead of grep. See edit.

|

markp-fuso · Accepted Answer · 2023-09-01 15:24:24Z

1

There's no need to use sed to convert the field delimiter from , to | since awk is able to parse the file on ,.

One awk idea:

awk  '
BEGIN { FS=OFS="," }
      { key = $1 OFS $2 OFS $3 OFS $4
        lines[key]=$5
        counts[key]++
      }
END   { for (i in counts) 
            if (counts[i]==1)                # unique if count == 1
               print i,lines[i]
      }
' file.txt

This generates:

20230901,47620,002,Grape,29
20230831,47689,001,Orange,15

NOTE: the order in which array indices is processed is not guaranteed; if the output must be sorted in a specific order we could add more code

re: OP's comment: I just need to make it in one line. A couple options come to mind:

Jam current code into one line, eg:

awk 'BEGIN {FS=OFS=","} {key = $1 OFS $2 OFS $3 OFS $4;lines[key]=$5;counts[key]++} END {for (i in counts) if (counts[i]==1) print i,lines[i]}' file.txt

Place the awk code into a separate file (eg, key.awk) then reference the file in the awk invocation, eg:

$ cat key.awk
BEGIN { FS=OFS="," }
      { key = $1 OFS $2 OFS $3 OFS $4
        lines[key]=$5
        counts[key]++
      }
END   { for (i in counts) 
            if (counts[i]==1)
               print i,lines[i]
      }

$ awk -f key.awk file.txt
20230901,47620,002,Grape,29
20230831,47689,001,Orange,15

edited Sep 1, 2023 at 15:24

answered Sep 1, 2023 at 14:57

markp-fuso

38.6k5 gold badges24 silver badges48 bronze badges

6 Comments

Candy Over a year ago

Would there be a little change in the your code if I want to reverse the logic? Say I want to get the duplicates only now.

markp-fuso Over a year ago

change if (counts[i]==1) to if (counts[i]>1)

Candy Over a year ago

I did try that one but, It only prints duplicate line , one for each group. And now I'm stuck

markp-fuso Over a year ago

ah, yeah, that's right ... need to store multiple rows; at this point you're modifying the requirements of the original question; chameleon questions are frowned upon; the suggested approach is to take the answer(s) you've received so far, see if you can modify the answer(s) to address the new requirement and if you run into issues then ask a new question

markp-fuso Over a year ago

@Candy a recent Q&A that may be of interest

|

Kent · Accepted Answer · 2023-09-01 14:55:52Z

1

Awk can solve your problem alone:

kent$ awk -F, '{k=$1 FS $2 FS $3 FS $4; a[k]++; b[k]=$0}
               END{for(x in a) if(a[x]==1)print b[x]}' file
20230901,47620,002,Grape,29
20230831,47689,001,Orange,15

answered Sep 1, 2023 at 14:55

Kent

197k36 gold badges248 silver badges317 bronze badges

Comments

Daweo · Accepted Answer · 2023-09-02 14:15:25Z

0

something wrong with my code?

$0 used inside END's action denotes last line, therefore

awk 'END{print $0}' file.txt

will give same output as

tail --lines=1 file.txt

answered Sep 2, 2023 at 14:15

Daweo

38.2k3 gold badges17 silver badges32 bronze badges

1 Comment

Ed Morton Over a year ago

The value of $0 inside an END section is undefined by POSIX so it can have different values in different awk variants or even different values in different versions of the same awk variant and it could have a different value in the next release than it does today for any given awk, e.g. it might contain the last record read or it might be null.

Collectives™ on Stack Overflow

Awk command to get unique lines in a file with multiple column

5 Answers 5

3 Comments

7 Comments

6 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

7 Comments

6 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related