removing duplicated strings within a column with shell

Question

I have a file with two columns separated by tabs as follows:

OG0000000   PF03169,PF03169,PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF00083,PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

I just want to remove duplicate strings within the second column, while not changing anything in the first column, so that my final output looks like this:

OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

I tried to start this by using awk.

awk 'BEGIN{RS=ORS=","} !seen[$0]++' file.txt

But my output looks like this, where there are still some duplicates if the duplicated string occurs first.

OG0000000   PF03169,PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF07690,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

I realize that the problem is because the first line that awk grabs is everything until the first comma, but I'm still rough with awk commands and couldn't figure out how to fix this without messing up the first column. Thanks in advance!

$0 denotes the whole line. Therefore, you record in your variable seen the unique whole lines, while you are interested in parts of the second column only. — user1934428
– user1934428, Commented Nov 18, 2022 at 8:20
I think you also did not specify the following case: Line 1 has OG1 A,B,C,B and Line 2 has OG2 B,D. Should the B from line 2 be removed too, because it already appeared in line 1? — user1934428
– user1934428, Commented Nov 18, 2022 at 8:22

anubhava · Accepted Answer · 2022-11-19 16:10:52Z

11

This awk should work for you:

awk -F '[\t,]' '
{
   printf "%s", $1 "\t"
   for (i=2; i<=NF; ++i) {
      if (!seen[$i]++)
         printf "%s,", $i
   }
   print ""
   delete seen
}' file

OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

PS: As per the expected output shown this solution also shows a trailing comma in each line.

edited Nov 19, 2022 at 16:10

answered Nov 18, 2022 at 6:22

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

David C. Rankin Over a year ago

Damn, compound FS -- shorter again!

Ed Morton Over a year ago

You can't rely on the usual (i<NF ? "," : ORS) idiom for this because if $NF is a duplicate then you won't print ORS for that line.

anubhava Over a year ago

Yes that's a good point Ed. I have noted that OP expects a trailing slash anyway so made it simple now.

Ed Morton Over a year ago

I don't see any other notes or comments anywhere suggesting a trailing , is required so I'll leave my comment in place for now but if you update your answer to mention that I'll delete it.

anubhava Over a year ago

Ed: It is as per the expected output shown in the question that has trailing comma in all the lines. I have made a note of it in my answer.

David C. Rankin · Accepted Answer · 2022-11-18 06:36:45Z

8

Another approach using the same spit of $2 into an array and keeping a separate counter for the position of the non-duplicated values posted could be done as:

awk '
  { 
    printf "%s\t", $1
    delete seen
    n = split($2,arr,",")
    pos = 0
    for (i=1;i<=n;i++) { 
      if (! (arr[i] in seen)) { 
        printf "%s%s", pos ? "," : "", arr[i]
        seen[arr[i]]=1
        pos++ 
      }
    }
    print ""
  }
' file.txt

Example Output

With your input in file.txt, the output is:

OG0000000       PF03169,MAC1_004431-T1,
OG0000002       PF07690,PF00083,
OG0000003       MAC1_000127-T1,
OG0000004       PF13246,PF00689,PF00690,
OG0000005       PF00012,PF01061,PF12697,

edited Nov 18, 2022 at 6:36

answered Nov 18, 2022 at 6:29

David C. Rankin

85.1k6 gold badges67 silver badges95 bronze badges

2 Comments

Ed Morton Over a year ago

if (! (arr[i] in seen)) { foo; seen[arr[i]]=1 } can be done a bit more concisely and idiomatically with if (!seen[arr[i]]++) { foo }

anubhava Over a year ago

++ so many good solutions

RavinderSingh13 · Accepted Answer · 2022-11-18 06:44:11Z

6

With your shown samples and attempts, please try following awk code. We need not to set RS and ORS they are Record separator and Output record separator respectively, which we need not to set in this requirement. Set FS and OFS to , and printing fields accordingly.

awk '
BEGIN{ FS=","; OFS="\t" }
{
  val=""
  delete arr
  num=split($2,arr,",")
  for(i=1;i<=num;i++){
   if(!arr[$i]++){
      val=(val?val ",":"") $i
   }
  }
  print $1,val
}
' Input_file

edited Nov 18, 2022 at 6:44

answered Nov 18, 2022 at 5:04

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

2 Comments

Ed Morton Over a year ago

An array used in the context of if(!arr[$i]++){ is idiomatically named seen[] rather than arr[].

Ed Morton Over a year ago

Hang on - you can't you split $2 by FS when $2 is already the result of splitting by FS (ditto for splitting by , when FS is ,)

potong · Accepted Answer · 2022-11-18 11:05:35Z

6

This might work for you (GNU sed):

sed -E ':a;s/(\s+.*(\b\S+,).*)\2/\1/;ta' file

Iterate through a line removing any duplicate strings after whitespace.

answered Nov 18, 2022 at 11:05

potong

59.3k6 gold badges55 silver badges92 bronze badges

2 Comments

anubhava Over a year ago

++ Nice, a shorter gnu-sed

bobble bubble Over a year ago

Very nice, not sure but have you missed another \b before \2? See this demo vs e.g. that one

sseLtaH · Accepted Answer · 2022-11-18 06:14:37Z

5

Using GNU sed

$ sed -E ':a;s/([^ \t]*[ \t]+)?(([[:alnum:]]+,).*)\3/\1\2/;ta' input_file
OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

answered Nov 18, 2022 at 6:14

sseLtaH

11.3k5 gold badges17 silver badges34 bronze badges

4 Comments

David C. Rankin Over a year ago

This deserves a nod for pure sed creativity. I've used sed a long time, and appreciate the ta repeat on successful substitution, but I'm still scratching my head a bit on the identification of dups and use of the first to back references to make it so. (I'll get there, it will just take a bit more scratching...) The biggest question is what if the dups were non-adjacent?

sseLtaH Over a year ago

@DavidC.Rankin Make the first backreference optional so the second backreference can loop. Nest a third parenthesis within the second backreference then use a greedy regex to remove the last occurrence of the third match returning everything within the second parenthesis in the loop. Sure, it will also handle non-adjacent dups on the same line as it does in the sample provided.

David C. Rankin Over a year ago

I had kinda sorted that from the 20,000 foot view, but I have got to tell you, that is certainly an impressive use of sed. Well done. (note: when I say "kinda" I mean I had sorted the flow and recognized the backreference nesting -- but was far from digesting it to the point where I had an "Ahah!" moment :)

anubhava Over a year ago

++ but I think this can be modified to POSIX sed also

dawg · Accepted Answer · 2022-11-18 14:48:42Z

2

Here is a ruby:

ruby -ane 'puts "#{$F[0]}\t#{$F[1].split(/(?<=.),(?=.)/).uniq.join(",")}"' file
OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

answered Nov 18, 2022 at 14:48

dawg

105k24 gold badges143 silver badges217 bronze badges

Collectives™ on Stack Overflow

removing duplicated strings within a column with shell

6 Answers 6

5 Comments

2 Comments

2 Comments

2 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

5 Comments

2 Comments

2 Comments

2 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related