0

My file is in the format

>id1
sequence1
>id2
sequence2
>id1
sequence3

the output i want is:

>id1
sequence1
>id2
sequence2

i.e. I need to remove sequences and id both in pairs if id is repeat.

I tried the following code, but it doesnt work.

awk '{
if(NR%2 == 1)
{
    fastaheader = $0; x[fasta_header] = x[fasta_header] + 1; 
}
else 
{
    seq = $0; {if(x[fasta_header] <= 1) {print fasta_header;print seq;}}
}
}' filename.txt
2
  • If you get two entries for ID = 'id1', will the sequence information always be the same in both entries? Or are you really looking at id1 with sequence1A and id1 with sequence1B, and you only want the sequence1A entry to be shown. Or is it the combination of id1 plus the sequence data that must be duplicated in its entirety (so you'd want both id1 with sequence1A and id1 with sequence1B to appear in the output)? Your question says "Remove ID and sequence if the ID is repeated"; your comments say "Remove ID and sequence if the combination of ID and sequence are repeated". Commented Jan 22, 2014 at 6:30
  • If you need to compare both the ID and the sequence information, then the answer by Jotne is the way to go. However, you also need to fix your question so it asks for that, not just for detecting repeated ID values as it currently does. Commented Jan 22, 2014 at 6:36

4 Answers 4

1

It looks as though the ID lines start with >. Given the order of the output, you want the first sequence associated with a given ID, not the last. This means you need something like:

awk '/^>/ { if (id[$1]++ == 0) printing = 1; else printing = 0 }
          { if (printing) print }'

The first line decides whether the current ID is unique and sets printing to 1 if it is, and 0 otherwise. The second line notes whether printing is required, and prints appropriately. Note that if there's more than one line of data in the sequence, it is quite happy to print all those lines. It does not rely on there being just one line in the sequence data.

Sign up to request clarification or add additional context in comments.

6 Comments

You does not take in care that sequence numbers needs to be tested too, see my example file2
what does '/^>/ do??
It looks for lines that start with a > sign.
@see my post for file2, but it would be interesting to see what priyanka responds to your question.
@Jotne: I've stated in my second comment to the question that your answer is good for one interpretation of what's wanted (and left unsaid that mine is good for a different interpretation of what's wanted). I confess to being puzzled about what actually is wanted; the comments don't match the question precisely, and seemed to favour your interpretation of what's wanted. Ambiguity — what would we do without it?
|
1

Assuming your ids and sequences are always exactly one line:

awk 'NR%2 && !a[$0]++ { print; getline l ; print l }' input

3 Comments

If I am correct you only test ID and ignores the sequence number. It needed to be tested too. Se my post example with file2
Jotne, you are correct. I am ignoring the sequence number, since the question states "remove sequences and id both in pairs if id is repeat."
OP has changed his post to reflect that.
1

This should do:

awk '{a[$0]++} END {for (i in a) print RS i}' RS=">" file | awk '!/^>?$/'
>id1
sequence1
>id2
sequence2

Using the RS=">" changes the record to include both id and sequence.

awk '{$1=$1}1' RS=">"
id1 sequence1
id2 sequence2
id1 sequence1

Then the array removes all duplicate

The last awk '!/^>?$/' just removes some blank spaces and an extra >


cat file2
>id1
sequence1
>id2
sequence2
>id1
sequence3

This file should be intact, since the number in sequence are all difference.

awk '{a[$0]++} END {for (i in a) print RS i}' RS=">" file2 | awk '!/^>?$/'
>id1
sequence1
>id2
sequence2
>id1
sequence3

1 Comment

Thanks, but can u please explain this? I need to generalise it to case where id,sequence, num are triplets and num may be different, even when id and sequence is same.
0

I prefer awk, you don't need pipe, and it prints lines in the sequence they appear in original file.

If you don't mind the line sequence, you can use sort

xargs -n2 < file  | sort -uk1,1 | xargs -n1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.