11

I have a file with two columns separated by tabs as follows:

OG0000000   PF03169,PF03169,PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF00083,PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

I just want to remove duplicate strings within the second column, while not changing anything in the first column, so that my final output looks like this:

OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

I tried to start this by using awk.

awk 'BEGIN{RS=ORS=","} !seen[$0]++' file.txt

But my output looks like this, where there are still some duplicates if the duplicated string occurs first.

OG0000000   PF03169,PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF07690,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

I realize that the problem is because the first line that awk grabs is everything until the first comma, but I'm still rough with awk commands and couldn't figure out how to fix this without messing up the first column. Thanks in advance!

2
  • 1
    $0 denotes the whole line. Therefore, you record in your variable seen the unique whole lines, while you are interested in parts of the second column only. Commented Nov 18, 2022 at 8:20
  • I think you also did not specify the following case: Line 1 has OG1 A,B,C,B and Line 2 has OG2 B,D. Should the B from line 2 be removed too, because it already appeared in line 1? Commented Nov 18, 2022 at 8:22

6 Answers 6

11

This awk should work for you:

awk -F '[\t,]' '
{
   printf "%s", $1 "\t"
   for (i=2; i<=NF; ++i) {
      if (!seen[$i]++)
         printf "%s,", $i
   }
   print ""
   delete seen
}' file

OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

PS: As per the expected output shown this solution also shows a trailing comma in each line.

Sign up to request clarification or add additional context in comments.

5 Comments

Damn, compound FS -- shorter again!
You can't rely on the usual (i<NF ? "," : ORS) idiom for this because if $NF is a duplicate then you won't print ORS for that line.
Yes that's a good point Ed. I have noted that OP expects a trailing slash anyway so made it simple now.
I don't see any other notes or comments anywhere suggesting a trailing , is required so I'll leave my comment in place for now but if you update your answer to mention that I'll delete it.
Ed: It is as per the expected output shown in the question that has trailing comma in all the lines. I have made a note of it in my answer.
8

Another approach using the same spit of $2 into an array and keeping a separate counter for the position of the non-duplicated values posted could be done as:

awk '
  { 
    printf "%s\t", $1
    delete seen
    n = split($2,arr,",")
    pos = 0
    for (i=1;i<=n;i++) { 
      if (! (arr[i] in seen)) { 
        printf "%s%s", pos ? "," : "", arr[i]
        seen[arr[i]]=1
        pos++ 
      }
    }
    print ""
  }
' file.txt

Example Output

With your input in file.txt, the output is:

OG0000000       PF03169,MAC1_004431-T1,
OG0000002       PF07690,PF00083,
OG0000003       MAC1_000127-T1,
OG0000004       PF13246,PF00689,PF00690,
OG0000005       PF00012,PF01061,PF12697,

2 Comments

if (! (arr[i] in seen)) { foo; seen[arr[i]]=1 } can be done a bit more concisely and idiomatically with if (!seen[arr[i]]++) { foo }
++ so many good solutions
6

With your shown samples and attempts, please try following awk code. We need not to set RS and ORS they are Record separator and Output record separator respectively, which we need not to set in this requirement. Set FS and OFS to , and printing fields accordingly.

awk '
BEGIN{ FS=","; OFS="\t" }
{
  val=""
  delete arr
  num=split($2,arr,",")
  for(i=1;i<=num;i++){
   if(!arr[$i]++){
      val=(val?val ",":"") $i
   }
  }
  print $1,val
}
' Input_file

2 Comments

An array used in the context of if(!arr[$i]++){ is idiomatically named seen[] rather than arr[].
Hang on - you can't you split $2 by FS when $2 is already the result of splitting by FS (ditto for splitting by , when FS is ,)
6

This might work for you (GNU sed):

sed -E ':a;s/(\s+.*(\b\S+,).*)\2/\1/;ta' file

Iterate through a line removing any duplicate strings after whitespace.

2 Comments

++ Nice, a shorter gnu-sed
Very nice, not sure but have you missed another \b before \2? See this demo vs e.g. that one
5

Using GNU sed

$ sed -E ':a;s/([^ \t]*[ \t]+)?(([[:alnum:]]+,).*)\3/\1\2/;ta' input_file
OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

4 Comments

This deserves a nod for pure sed creativity. I've used sed a long time, and appreciate the ta repeat on successful substitution, but I'm still scratching my head a bit on the identification of dups and use of the first to back references to make it so. (I'll get there, it will just take a bit more scratching...) The biggest question is what if the dups were non-adjacent?
@DavidC.Rankin Make the first backreference optional so the second backreference can loop. Nest a third parenthesis within the second backreference then use a greedy regex to remove the last occurrence of the third match returning everything within the second parenthesis in the loop. Sure, it will also handle non-adjacent dups on the same line as it does in the sample provided.
I had kinda sorted that from the 20,000 foot view, but I have got to tell you, that is certainly an impressive use of sed. Well done. (note: when I say "kinda" I mean I had sorted the flow and recognized the backreference nesting -- but was far from digesting it to the point where I had an "Ahah!" moment :)
++ but I think this can be modified to POSIX sed also
2

Here is a ruby:

ruby -ane 'puts "#{$F[0]}\t#{$F[1].split(/(?<=.),(?=.)/).uniq.join(",")}"' file
OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.