1

I am trying to get the unique lines in a file with multiple columns.

My file "file.txt" contains sample record below

20230830,52678,004,Apple,21
20230830,52678,004,Apple,20
20230830,52678,004,Apple,19
20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

My desired output is to print only uniques lines from column 1 to 4. Regardless of the value on their column 5

20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

I tried using sed to add a unique separator between columns 1-4 and column 5

And then I use awk command to get unique lines from col 1-4

sed 's/,/|/4' file.txt | awk -F"|" '{arr[$1]++} END{for(i in arr) if(arr[i]==1) print $0}'

With this code, it works with small set of data but when I use in a file with 1000 lines, I get...

20230831,47689,001,Orange,15
20230831,47689,001,Orange,15
20230831,47689,001,Orange,15
20230831,47689,001,Orange,15
...

unique values keeps on comming. They are duplicating. Seems like I'm only getting one unique line and it's keeps duplicating.

Can you help me if there's something wrong with my code?

I am expecting to print only unique lines like this

20230831,47689,001,Orange,15
20230901,47620,002,Grape,29
2
  • What part of your code is supposed to eliminate duplicates? Commented Sep 1, 2023 at 14:41
  • @markp-fuso I've edited the code. You can try it again Commented Sep 1, 2023 at 14:53

5 Answers 5

4
$ awk -F'[^,]*$' 'FNR==NR{cnt[$1]++; next} cnt[$1]==1' inputfile inputfile
20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

As Ed Morton mentioned in the comment, the last field is the FS -F'[^,]*$', so we can use the rest (from the begining to the last ,) as key $1

In the first processing we count the key cnt[$1]++

In the second one we look for cnt[$1]==1 to get the unique keys

Sign up to request clarification or add additional context in comments.

3 Comments

I like the concept of processing the file twice. Except the key is the concatenation of the first 4 fields.
@glennjackman ufopilot is using the whole 5th field as the FS so $1 in the script is the concatenation of the first 4 ,-separated fields. Cute.
@ufopilot I do like that idea you used of setting FS in a way that it'd match the last field to isolate the first 4 fields to use as a key and obviously it can be expanded to -F'(,[^,]*){n}$' to isolate the first NF-n fields (n can even be calculated in the script), I don't think I've seen that before and may have to adopt it for similar problems in future. Thanks!
2

This Unix pipe will identify lines in your file that are duplicates based on the first 4 fields:

$ cut -d, -f 1-4 file | uniq -d
20230830,52678,004,Apple

You can then use grep to inverse that match so duplicates are skipped:

$ grep -vF -f <(cut -d, -f 1-4 file | uniq -d) file  

Prints:

20230831,47689,001,Orange,15
20230901,47620,002,Grape,29

The advantage here is lower memory requirement since you do not have to hold the whole file in memory.


Ed Morton brings up legit point that the first four columns of a,b,c,d,e as a fixed string might match Xa,b,c,d in grep.

To solve this, use awk instead of grep:

awk -F, '{k=$1 FS $2 FS $3 FS $4} FNR==NR{d[k]; next} !(k in d)' <(cut -d , -f 1-4 file | uniq -d ) file

This does have the advantage of only holding the dups rather than the entire file (if you are using a Rasberry Pi) but in most environments use awk alone.

7 Comments

That will fail when the first 4 fields are a subset of a different 4 fields, e.g. if a,b,c,d,e and Xa,b,c,dY,e both exist in the input. It'll also fail if the input can contain regexp metachars, e.g. if a,.*,c,d,e exists in the input. Unfortunately you can't solve those problems without introducing a call to sed or similar to transform the cut | uniq output before running grep using it.
The grep -F command is for fixed strings so metachar and substrings are ignored, no?
grep -F would solve the metachar problem but not the substrings one. You could use sed to provide anchors around the cut | uniq output to solve the substring problem but then you'd need to remove -F and then you'd have the metachars problem (which you could do more sed escaping to solve). With different data you could use -Fx to solve both of those problems but that won't work in this case since they need to just match part of the line.
Right but then you'd still have a,b,c,d,e vs Xa,b,c,d,e to deal with. That was the point of my comment - you need to introduce 1 or more calls to sed or similar to make this approach work robustly.
Fair enough. Use awk instead of grep. See edit.
|
1

There's no need to use sed to convert the field delimiter from , to | since awk is able to parse the file on ,.

One awk idea:

awk  '
BEGIN { FS=OFS="," }
      { key = $1 OFS $2 OFS $3 OFS $4
        lines[key]=$5
        counts[key]++
      }
END   { for (i in counts) 
            if (counts[i]==1)                # unique if count == 1
               print i,lines[i]
      }
' file.txt

This generates:

20230901,47620,002,Grape,29
20230831,47689,001,Orange,15

NOTE: the order in which array indices is processed is not guaranteed; if the output must be sorted in a specific order we could add more code


re: OP's comment: I just need to make it in one line. A couple options come to mind:

Jam current code into one line, eg:

awk 'BEGIN {FS=OFS=","} {key = $1 OFS $2 OFS $3 OFS $4;lines[key]=$5;counts[key]++} END {for (i in counts) if (counts[i]==1) print i,lines[i]}' file.txt

Place the awk code into a separate file (eg, key.awk) then reference the file in the awk invocation, eg:

$ cat key.awk
BEGIN { FS=OFS="," }
      { key = $1 OFS $2 OFS $3 OFS $4
        lines[key]=$5
        counts[key]++
      }
END   { for (i in counts) 
            if (counts[i]==1)
               print i,lines[i]
      }

$ awk -f key.awk file.txt
20230901,47620,002,Grape,29
20230831,47689,001,Orange,15

6 Comments

Would there be a little change in the your code if I want to reverse the logic? Say I want to get the duplicates only now.
change if (counts[i]==1) to if (counts[i]>1)
I did try that one but, It only prints duplicate line , one for each group. And now I'm stuck
ah, yeah, that's right ... need to store multiple rows; at this point you're modifying the requirements of the original question; chameleon questions are frowned upon; the suggested approach is to take the answer(s) you've received so far, see if you can modify the answer(s) to address the new requirement and if you run into issues then ask a new question
@Candy a recent Q&A that may be of interest
|
1

Awk can solve your problem alone:

kent$ awk -F, '{k=$1 FS $2 FS $3 FS $4; a[k]++; b[k]=$0}
               END{for(x in a) if(a[x]==1)print b[x]}' file
20230901,47620,002,Grape,29
20230831,47689,001,Orange,15

Comments

0

something wrong with my code?

$0 used inside END's action denotes last line, therefore

awk 'END{print $0}' file.txt

will give same output as

tail --lines=1 file.txt

1 Comment

The value of $0 inside an END section is undefined by POSIX so it can have different values in different awk variants or even different values in different versions of the same awk variant and it could have a different value in the next release than it does today for any given awk, e.g. it might contain the last record read or it might be null.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.