1

I have many files in a folder. And I want to open and read them in the sequence order depending on a reference file. my file name:

AAAAA_AAAAA.CCCCC3.1.bbb.DDDDD.1.fa   
AAAAA_AAAAA.CCCCC3.1.bbb.DDDDD.2.fa   
AAAAA_AAAAA.CCCCC3.1.bbb.DDDDD.3.fa   
AAAAA_AAAAA.CCCCC3.1.bbb.DDDDD.4.fa   
.  
.  
.   

The reference file structure:

 chr1   744     745  
 chr1   1208    1209  
 chr2   1250    1251   
 chr2   1454    1455  
 chr3   1676    1677  
 chr3   1683    1684  

The input file structure:

AAAAA_AAAAA.CCCCC3.1.bbb.DDDDD.1.fa
>1 dna:
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTATGTGAGAAGATAGCTGAA
CGCCTTGTCCACATCATCTTACTGCTGAGAGTTGAGCTCACCCTCAGTCCCTCACAGTTC   

AAAAA_AAAAA.CCCCC3.1.bbb.DDDDD.2.fa
>2 dna:
GAGAGCTGGCTTCTAGGCATGCTTCCTTTTGAGAGCTGAGGACAGGACAGAACCCTCCCG
CATCCTGCCTGACTGTAGACGTACCTGCTAACCTCCTCATGTTAGTGGCTGGGATAGATT
GTGGGAAAAGCATGTGTAAGCATTGGGCCTGAACTCCCGTGTATCTGAGTTGAATACAGC
GATTTCCAACATCCTTCTTCAATAGGAGTGTAGCTAGGTTCCAACTCCCATGTCCGAGTG
GGTAGCAGACATCTGCCTTCCATGCATACACACTTCTGAGAGTTGAGCTTATGGCCTGTA
ACCCTACCTCCTGCCTGCAGCTACCTTTTGCTTCCAAAAGTCCTAGGCTCGCTGCTTCAC
CAAAGTGTTGGGAGAGGTAACTGTTGTCTCCCGGCACACAAGACTAGTGCCTCCAAGCTC
AATCCAGCGATTTCCCAGTAATTCCTGGGTTAGACTGGTGCTACATACTAAGTTCCATAC
GTGAGTAGGTAGTTGAAAGCCTTGTCCAAAAACATCTTACTTCTGAGAGTTGAGCTCACC
CTCAGTCCCTCACAGTTCCACACTGCCTGCAGAGTGAGTTTCCCACGTCTTCATCAGAGA
CTTTTGCCAGAGGCTTCTGAGACGCAAGTTAACAATGCAAACAGGAGGGTATACCCAGGT
GCAGTAGATTGGTTATCTGGGAACCTCCTTACTCAGAATACTGTTACCTTCACACTGTCA
TAAGAATGCAGCTAGTTGAGAGCTGGCTTCTAGGCATGCTTCCCTGTGAGAGCTGAGGAC

my outputs:
chr1 A
chr1 G
chr2 C
chr2 C
chr3 T
chr3 T

I can use bioperl to find the position and print out the values one by one (file by file).

Then I try to do the open and read files from a folder.

my $dir = '/home/Documents/Folder/';
opendir(DIR, $dir) or die $!;
my @files = grep (/.fa$/, readdir(DIR));

for my $list(@files){  ##try to get the last number from file name##
my @lines = split /\./, $list}  

open and read my reference file

open my $POS, '<', 'CanFam3_SNP_POS.txt' or die $!;

I put all the files into an array and sort them.

  my @sorted = @files;
  foreach my $i (0..$#sorted)

Then I try to use a loop control to open and read a file depending on reference file column 1 values. For example chr1, the AAAAA_AAAAA.CCCCC3.1.bbb.DDDDD.1.fa should be read and process. If reading the chr2 from reference file, break the loop, and then open and read the AAAAA_AAAAA.CCCCC3.1.bbb.DDDDD.2.fa, process the file with chr2.

open my $fh, '<', "/home/Documents/Folder/$sorted[$i]" or die $!;
while (my $line = <$POS>){

  chomp($line);

  if ($line =~ /chr$lines[5]/g){

  my @positions = split (/\t/, $line);

  print "$positions[0]","\t","$positions[1]","\t", substr($so->seq(), 
  $positions[1], $positions[2] - $positions[1]),"\n";

  last if ($line !~ /chr$lines[5]/g)

    }
 }

I think I have some problems with this codes. Can I use perl to do this process? Do I misunderstand some points?

7
  • How do the "chr1", "chr2", etc. map to your fasta filenames? Or do they? Commented Oct 15, 2018 at 1:15
  • No, chr1 and chr2...do not match the file name. I only can identiy thme by the end of number from the file name Commented Oct 15, 2018 at 1:21
  • If the filenames don't correspond to the chr lines... how in the world do you know what file to look in? Commented Oct 15, 2018 at 1:23
  • And for the files name, some files name as AAAAA_AAAAA.CCCCC3.1.bbb.DDDDD.X.fa; AAAAA_AAAAA.CCCCC3.1.bbb.DDDDD.MT.fa Commented Oct 15, 2018 at 1:23
  • When I read the number at the end of filename. I can understand this file is for which chr Commented Oct 15, 2018 at 1:26

1 Answer 1

2

The key here is to only look in the particular FastA file that you need for a given line in your reference file. From the excerpts of code you posted, it looks like you're trying to read every single file for every single line (And not managing to do that).

So, consider the following:

#!/usr/bin/perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use File::Basename;

# Map the fasta files in a given directory to chr numbers.
my $fa_dir = '/home/Documents/Folder/';
my %fa_files =
  map { (split /\./, fileparse($_, '.fa'))[5] => $_ } glob("$fa_dir/*.fa");

open my $chrs, '<', 'CanFam3_SNP_POS.txt';
# Read each line of the reference file
while (<$chrs>) {
  chomp;
  # Split up the chr and offsets
  my @fields = split /\s+/, $_;  #/

  # Extract the chr number
  my $chr = $fields[0];
  $chr =~ s/^chr//;      #/
  warn "Unknown chr $chr!\n" and next unless exists $fa_files{$chr};

  # And read from the appropriate fasta file
  # You should probably use a library to read the file instead of
  # this mess, but I don't know which ones are good. Based on your code
  # you might be trying to use one already?
  open my $fa, '<', $fa_files{$chr};
  my $hdr = <$fa>;
  my $data = join "", <$fa>;
  $data =~ s/[^ACGT]+//sg;
  close $fa;

  # And display the requested part
  warn "Invalid offset for chr $chr\n" and next unless length($data) > $fields[1];
  my $range = substr $data, $fields[1], $fields[2] - $fields[1];
  say "chr$chr $range";
}

It stores every .fa file in a given directory into a hashtable, keyed by that last element of the filename that corresponds to what's after chr in the reference file. That makes it easy to look up the exact file you need to read from to print out the requested fragment.

Also note the use of glob() to read the filenames, instead of opendir()/readdir(). Simpler to filter based on the extension that way, and using File::Basename to get just the filename minus the path and extension in an OS-independent way.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.