That is my first post. I would like to write a small script to count multiple unique repeats in a line. The text is a DNA sequence enter link description here, so the text will be combinations of four letters: A, T, G and C. If one string appears two times, it will be counted twice, and so on.
The unique strings I want to look for are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to count repeats of four or more.
Strings to count:
AGAGAG
GAGAGA
CTCTCT
TCTCTC
Example input file (two columns separated by a tab):
Sequence_1 AGAGAG
Sequence_2 AGAGAGT
Sequence_3 AGAGAGAG
Sequence_4 AGAGAT
Sequence_5 AGAGAGAGAGAGAGAGAGT
Sequence_6 AGAGAGTAGAGAG
Sequence_7 CTCTCTCTCTC
Sequence_8 TAGAGAGAT
Sequence_9 TAAGAGAGAAG
Desired output:
Sequence_1 AGAGAG 1
Sequence_2 AGAGAGT 1
Sequence_3 AGAGAGAG 0
Sequence_4 AGAGAT 0
Sequence_5 AGAGAGAGAGAGAGAGAG 0
Sequence_6 AGAGAGTAGAGAG 2
Sequence_7 CTCTCTCTCTCAAGAGAG 1
Sequence_8 TAGAGAGAT 1
Sequence_9 TAAGAGAGAAG 1
I have a small one-liner written with awk, but I think it is not specific when matching the strings:
awk '{if($1 ~ /AGAGAG/)x++; if($1 ~ /TCTCTC/)x++;if($1 ~ /GAGAGA/)x++;if($1 ~ /CTCTCT/)x++;print x;x=0}' inputfile.tab
Thanks so much for your help. All the best, Bernardo