Count multiple unique strings in a line

Question

That is my first post. I would like to write a small script to count multiple unique repeats in a line. The text is a DNA sequence enter link description here, so the text will be combinations of four letters: A, T, G and C. If one string appears two times, it will be counted twice, and so on.

The unique strings I want to look for are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to count repeats of four or more.

Strings to count:

AGAGAG
GAGAGA
CTCTCT
TCTCTC

Example input file (two columns separated by a tab):

Sequence_1    AGAGAG                   
Sequence_2    AGAGAGT                  
Sequence_3    AGAGAGAG                 
Sequence_4    AGAGAT                   
Sequence_5    AGAGAGAGAGAGAGAGAGT      
Sequence_6    AGAGAGTAGAGAG 
Sequence_7    CTCTCTCTCTC  
Sequence_8    TAGAGAGAT                
Sequence_9    TAAGAGAGAAG

Desired output:

Sequence_1    AGAGAG                   1
Sequence_2    AGAGAGT                  1
Sequence_3    AGAGAGAG                 0
Sequence_4    AGAGAT                   0
Sequence_5    AGAGAGAGAGAGAGAGAG       0
Sequence_6    AGAGAGTAGAGAG            2
Sequence_7    CTCTCTCTCTCAAGAGAG       1 
Sequence_8    TAGAGAGAT                1
Sequence_9    TAAGAGAGAAG              1

I have a small one-liner written with awk, but I think it is not specific when matching the strings:

awk '{if($1 ~ /AGAGAG/)x++; if($1 ~ /TCTCTC/)x++;if($1 ~ /GAGAGA/)x++;if($1 ~ /CTCTCT/)x++;print x;x=0}' inputfile.tab

Thanks so much for your help. All the best, Bernardo

janos · Accepted Answer · 2013-09-28 22:11:43Z

1

I think there are some inconsistencies in your description and in the sample input and outputs. So this script might not be perfect, but I hope it comes close enough that you can figure out the rest:

#!/usr/bin/perl -n

my ($seq, $dna) = split(/\s+/);
my @strings = qw/AG GA CT TC/;
my $count = 0;
foreach my $s (@strings) {
    my ($b, $e) = split(//, $s);
    @matches = $dna =~ m/(?<!$e)($s){3}(?!$b)/g;
    $count += scalar(@matches);
}
print join("\t", $seq, sprintf("%-20s", $dna), $count), "\n";

You can use it with:

./script.pl < sample.txt

For input:

Sequence_1    AGAGAG
Sequence_2    AGAGAGT
Sequence_3    AGAGAGAG
Sequence_4    AGAGAT
Sequence_5    AGAGAGAGAGAGAGAGAGT
Sequence_6    AGAGAGTAGAGAG
Sequence_7    CTCTCTCTCTCAAGAGAG

It gives:

Sequence_1    AGAGAG                1
Sequence_2    AGAGAGT               1
Sequence_3    AGAGAGAG              0
Sequence_4    AGAGAT                0
Sequence_5    AGAGAGAGAGAGAGAGAGT   0
Sequence_6    AGAGAGTAGAGAG         2
Sequence_7    CTCTCTCTCTCAAGAGAG    1

How it works:

Thanks to the -n flag in the shebang, the script is executed for each line coming from stdin
@strings is the list of strings we are interested in
For each item in @strings, we count the matches
- $s takes on the value of AG, GA, CT, TC
- The expression (?<!$s)($s){3}(?!$s) matches 3 consecutive $s that is not followed by $s and not preceded by $s
- The expression (?<!$e)($s){3}(?!$b) matches 3 consecutive $s that is not followed by the 1st character of $s and not preceded by the 2nd character of $s
- The operation $x =~ m///g returns an array of all matches
- scalar(@matches) is the size of the array of all matches, we add it to the count

edited Sep 28, 2013 at 22:11

answered Sep 28, 2013 at 11:10

janos

126k31 gold badges242 silver badges253 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

biotech Over a year ago

Hi janos. Definitely, Sequence_3 must have zero counts. Although it contains the string (GA)3, this one is at the same time inside a (AG)4, so should not be count as positive. Sorry if I didn`t make enough emphasis!

janos Over a year ago

I modified the script the produce the desired output for your input. But I'm not sure it handles all corner cases. Test it well, and if you find a case that is not handled correctly, then update the sample in your question.

biotech Over a year ago

Hi janos, I added Sequences 8 and 9. There are counted as negative because one corner is followed by the 1st character of $s or preceded by the 2nd character of $s. We should modify the script to tolerate one of this cases but not both at the same time. Is that correct?

janos Over a year ago

That's a toughie... I'm really busy now, but I'll try to figure this out and get back to you in a few of days. Maybe by then a real regex ninja will step in and finish it up for you I hope...

biotech Over a year ago

Hi janos, I will try to help you this weekend

Collectives™ on Stack Overflow

Count multiple unique strings in a line

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related