1

That is my first post. I would like to write a small script to count multiple unique repeats in a line. The text is a DNA sequence enter link description here, so the text will be combinations of four letters: A, T, G and C. If one string appears two times, it will be counted twice, and so on.

The unique strings I want to look for are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to count repeats of four or more.

Strings to count:

AGAGAG
GAGAGA
CTCTCT
TCTCTC

Example input file (two columns separated by a tab):

Sequence_1    AGAGAG                   
Sequence_2    AGAGAGT                  
Sequence_3    AGAGAGAG                 
Sequence_4    AGAGAT                   
Sequence_5    AGAGAGAGAGAGAGAGAGT      
Sequence_6    AGAGAGTAGAGAG 
Sequence_7    CTCTCTCTCTC  
Sequence_8    TAGAGAGAT                
Sequence_9    TAAGAGAGAAG              

Desired output:

Sequence_1    AGAGAG                   1
Sequence_2    AGAGAGT                  1
Sequence_3    AGAGAGAG                 0
Sequence_4    AGAGAT                   0
Sequence_5    AGAGAGAGAGAGAGAGAG       0
Sequence_6    AGAGAGTAGAGAG            2
Sequence_7    CTCTCTCTCTCAAGAGAG       1 
Sequence_8    TAGAGAGAT                1
Sequence_9    TAAGAGAGAAG              1

I have a small one-liner written with awk, but I think it is not specific when matching the strings:

awk '{if($1 ~ /AGAGAG/)x++; if($1 ~ /TCTCTC/)x++;if($1 ~ /GAGAGA/)x++;if($1 ~ /CTCTCT/)x++;print x;x=0}' inputfile.tab

Thanks so much for your help. All the best, Bernardo

0

1 Answer 1

1

I think there are some inconsistencies in your description and in the sample input and outputs. So this script might not be perfect, but I hope it comes close enough that you can figure out the rest:

#!/usr/bin/perl -n

my ($seq, $dna) = split(/\s+/);
my @strings = qw/AG GA CT TC/;
my $count = 0;
foreach my $s (@strings) {
    my ($b, $e) = split(//, $s);
    @matches = $dna =~ m/(?<!$e)($s){3}(?!$b)/g;
    $count += scalar(@matches);
}
print join("\t", $seq, sprintf("%-20s", $dna), $count), "\n";

You can use it with:

./script.pl < sample.txt

For input:

Sequence_1    AGAGAG
Sequence_2    AGAGAGT
Sequence_3    AGAGAGAG
Sequence_4    AGAGAT
Sequence_5    AGAGAGAGAGAGAGAGAGT
Sequence_6    AGAGAGTAGAGAG
Sequence_7    CTCTCTCTCTCAAGAGAG

It gives:

Sequence_1    AGAGAG                1
Sequence_2    AGAGAGT               1
Sequence_3    AGAGAGAG              0
Sequence_4    AGAGAT                0
Sequence_5    AGAGAGAGAGAGAGAGAGT   0
Sequence_6    AGAGAGTAGAGAG         2
Sequence_7    CTCTCTCTCTCAAGAGAG    1

How it works:

  • Thanks to the -n flag in the shebang, the script is executed for each line coming from stdin
  • @strings is the list of strings we are interested in
  • For each item in @strings, we count the matches
    • $s takes on the value of AG, GA, CT, TC
    • The expression (?<!$s)($s){3}(?!$s) matches 3 consecutive $s that is not followed by $s and not preceded by $s
    • The expression (?<!$e)($s){3}(?!$b) matches 3 consecutive $s that is not followed by the 1st character of $s and not preceded by the 2nd character of $s
    • The operation $x =~ m///g returns an array of all matches
    • scalar(@matches) is the size of the array of all matches, we add it to the count
Sign up to request clarification or add additional context in comments.

5 Comments

Hi janos. Definitely, Sequence_3 must have zero counts. Although it contains the string (GA)3, this one is at the same time inside a (AG)4, so should not be count as positive. Sorry if I didn`t make enough emphasis!
I modified the script the produce the desired output for your input. But I'm not sure it handles all corner cases. Test it well, and if you find a case that is not handled correctly, then update the sample in your question.
Hi janos, I added Sequences 8 and 9. There are counted as negative because one corner is followed by the 1st character of $s or preceded by the 2nd character of $s. We should modify the script to tolerate one of this cases but not both at the same time. Is that correct?
That's a toughie... I'm really busy now, but I'll try to figure this out and get back to you in a few of days. Maybe by then a real regex ninja will step in and finish it up for you I hope...
Hi janos, I will try to help you this weekend

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.