I have a file with two columns separated by tabs as follows:
OG0000000 PF03169,PF03169,PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,PF00083,PF07690,PF00083,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,PF00012,
I just want to remove duplicate strings within the second column, while not changing anything in the first column, so that my final output looks like this:
OG0000000 PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,
I tried to start this by using awk.
awk 'BEGIN{RS=ORS=","} !seen[$0]++' file.txt
But my output looks like this, where there are still some duplicates if the duplicated string occurs first.
OG0000000 PF03169,PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,PF07690,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,PF00012,
I realize that the problem is because the first line that awk grabs is everything until the first comma, but I'm still rough with awk commands and couldn't figure out how to fix this without messing up the first column. Thanks in advance!
$0denotes the whole line. Therefore, you record in your variableseenthe unique whole lines, while you are interested in parts of the second column only.OG1 A,B,C,Band Line 2 hasOG2 B,D. Should the B from line 2 be removed too, because it already appeared in line 1?