I am very new in Perl and I am confused how to do this task. I have two files:
- Seq.txt, which contains many sequences (database)
- PID.txt, which contains only IDs (query) for some sequences which I need to extract from Seq.txt file.
Here I am giving a small part of my both files:
Seq.txt contains:
'>' SCO0700, probable ABC transporter protein, ATP-binding component.
MASSMEKPLDHRYRGEHPIRTLVYLFRADRRRLAGAVAVFTVKHSPIWLLPLVTAAIVDT
VVQHGPITDLWTSTGLIMFILVVNYPLHLLYVRLLYGSVRRMGTALRSALCTRMQQLSIG
'>' SCO0755, putative ABC transporter 797720:799942 forward MW:79858
VSTAQETRGRRRAAPPRRSVPKSRARTVRTPTVLQMEAVECGAASLAMVLGHYGRHVPLE
ELRIACGVSRDGSRASNLLKAARSYGFTAKGMQMDLAALAEVTAPAILFWEFNHYVVYDG
'>' SCO2305,putative ABC transporter ATP-binding subunit 2474063:2474989 forward MW:32345
MRPTEGTTPAVAFTGAAKAYGDVRAVDGVDLRIGCGETVALLGRNGAGKSTTIALLLGLC
PPDAGTVELFGGPAERAVRAGRVGAMLQEARAVPRVTVGELVAFVAGRYPAPMPVGQALE
'>' SCO1144, putative ABC transporter ATP-binding protein 1202480:1204282 reverse MW:64637
MHPDRESAWTAPADAVEQPRQVRRILKLFRPYRGRLAVVGLLVGAASLVSVATPFLLKEI
LDVAIPEGRTGLLSLLALGMIFGAVLTSVFGVLQTLISTTVGQRVMHDLRTAVYGRLQQM
'>' SCO1148, putative ABC transporter 1207772:1209553 forward MW:63721
MIGVAPPSYDPAAPTTANTLPVGARPTVRAYVGELLRRHRRAFLFLVTVNTVAVIASMAG
PYLLGGLVERVSDDARELRLGLTATLFVLALVVQAVFVREVRLRGAVLGERMLADLREDF
PID.txt contains:
SCO0755
SCO1144
Code I have written:
open (PID, 'PID.txt');
my @PID = '<'PID'>';
close(PID);
open (MSD, 'Seq.txt');
my @MSD = '<'MSD'>';
close(MSD);
chomp(@MSD);
my $MSD=join (' ', @MSD);
print "$MSD \n";
for ($i = 0; $i<=2; $i++) {
my $a=$PID[$i];
if ($MSD =~ m/$a(.*?)>/) # ">" end of the string
{
print "$1 \n";
$output= ">".$a.$1;
print $output;
open (MYFILE, '>>data.txt');
print MYFILE "$output\n";
close (MYFILE);
}
}
Why is it not recognizing $a? If I put [$a], then the binding operator recognize $a but do not return my desired sequence (with ID stored in $a), instead it returns the very first sequence.
The result I expect is:
'>' SCO0755, putative ABC transporter 797720:799942 forward MW:79858
VSTAQETRGRRRAAPPRRSVPKSRARTVRTPTVLQMEAVECGAASLAMVLGHYGRHVPLE
ELRIACGVSRDGSRASNLLKAARSYGFTAKGMQMDLAALAEVTAPAILFWEFNHYVVYDG
'>' SCO1144, putative ABC transporter ATP-binding protein 1202480:1204282 reverse MW:64637
MHPDRESAWTAPADAVEQPRQVRRILKLFRPYRGRLAVVGLLVGAASLVSVATPFLLKEI
LDVAIPEGRTGLLSLLALGMIFGAVLTSVFGVLQTLISTTVGQRVMHDLRTAVYGRLQQM