removing duplicates using awk in unix

Question

My file is in the format

>id1
sequence1
>id2
sequence2
>id1
sequence3

the output i want is:

>id1
sequence1
>id2
sequence2

i.e. I need to remove sequences and id both in pairs if id is repeat.

I tried the following code, but it doesnt work.

awk '{
if(NR%2 == 1)
{
    fastaheader = $0; x[fasta_header] = x[fasta_header] + 1; 
}
else 
{
    seq = $0; {if(x[fasta_header] <= 1) {print fasta_header;print seq;}}
}
}' filename.txt

If you get two entries for ID = 'id1', will the sequence information always be the same in both entries? Or are you really looking at id1 with sequence1A and id1 with sequence1B, and you only want the sequence1A entry to be shown. Or is it the combination of id1 plus the sequence data that must be duplicated in its entirety (so you'd want both id1 with sequence1A and id1 with sequence1B to appear in the output)? Your question says "Remove ID and sequence if the ID is repeated"; your comments say "Remove ID and sequence if the combination of ID and sequence are repeated". — Jonathan Leffler
– Jonathan Leffler, Commented Jan 22, 2014 at 6:30
If you need to compare both the ID and the sequence information, then the answer by Jotne is the way to go. However, you also need to fix your question so it asks for that, not just for detecting repeated ID values as it currently does. — Jonathan Leffler
– Jonathan Leffler, Commented Jan 22, 2014 at 6:36

Jonathan Leffler · Accepted Answer · 2014-01-22 06:21:31Z

1

It looks as though the ID lines start with >. Given the order of the output, you want the first sequence associated with a given ID, not the last. This means you need something like:

awk '/^>/ { if (id[$1]++ == 0) printing = 1; else printing = 0 }
          { if (printing) print }'

The first line decides whether the current ID is unique and sets printing to 1 if it is, and 0 otherwise. The second line notes whether printing is required, and prints appropriately. Note that if there's more than one line of data in the sequence, it is quite happy to print all those lines. It does not rely on there being just one line in the sequence data.

answered Jan 22, 2014 at 6:21

Jonathan Leffler

760k145 gold badges961 silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jotne Over a year ago

You does not take in care that sequence numbers needs to be tested too, see my example file2

priyanka Over a year ago

what does '/^>/ do??

Jonathan Leffler Over a year ago

It looks for lines that start with a > sign.

Jotne Over a year ago

@see my post for file2, but it would be interesting to see what priyanka responds to your question.

Jonathan Leffler Over a year ago

@Jotne: I've stated in my second comment to the question that your answer is good for one interpretation of what's wanted (and left unsaid that mine is good for a different interpretation of what's wanted). I confess to being puzzled about what actually is wanted; the comments don't match the question precisely, and seemed to favour your interpretation of what's wanted. Ambiguity — what would we do without it?

|

William Pursell · Accepted Answer · 2014-01-22 06:22:42Z

1

Assuming your ids and sequences are always exactly one line:

awk 'NR%2 && !a[$0]++ { print; getline l ; print l }' input

answered Jan 22, 2014 at 6:22

William Pursell

214k49 gold badges279 silver badges317 bronze badges

3 Comments

Jotne Over a year ago

If I am correct you only test ID and ignores the sequence number. It needed to be tested too. Se my post example with file2

William Pursell Over a year ago

Jotne, you are correct. I am ignoring the sequence number, since the question states "remove sequences and id both in pairs if id is repeat."

Jotne Over a year ago

OP has changed his post to reflect that.

Jotne · Accepted Answer · 2014-01-22 06:29:35Z

1

This should do:

awk '{a[$0]++} END {for (i in a) print RS i}' RS=">" file | awk '!/^>?$/'
>id1
sequence1
>id2
sequence2

Using the RS=">" changes the record to include both id and sequence.

awk '{$1=$1}1' RS=">"
id1 sequence1
id2 sequence2
id1 sequence1

Then the array removes all duplicate

The last awk '!/^>?$/' just removes some blank spaces and an extra >

cat file2
>id1
sequence1
>id2
sequence2
>id1
sequence3

This file should be intact, since the number in sequence are all difference.

awk '{a[$0]++} END {for (i in a) print RS i}' RS=">" file2 | awk '!/^>?$/'
>id1
sequence1
>id2
sequence2
>id1
sequence3

edited Jan 22, 2014 at 6:29

answered Jan 22, 2014 at 6:20

Jotne

41.7k13 gold badges54 silver badges58 bronze badges

1 Comment

priyanka Over a year ago

Thanks, but can u please explain this? I need to generalise it to case where id,sequence, num are triplets and num may be different, even when id and sequence is same.

ray · Accepted Answer · 2014-01-22 08:02:16Z

0

I prefer awk, you don't need pipe, and it prints lines in the sequence they appear in original file.

If you don't mind the line sequence, you can use sort

xargs -n2 < file  | sort -uk1,1 | xargs -n1

answered Jan 22, 2014 at 8:02

ray

4,2951 gold badge20 silver badges12 bronze badges

Collectives™ on Stack Overflow

removing duplicates using awk in unix

4 Answers 4

6 Comments

3 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related