Script Failure because of Large Arrays

Question

My current script runs through data from a PDB and stores them to arrays then those arrays are used for the rest of the script. The script runs very well on a small PDB file but when I use a real PDB file I end up using all the computers memory on just one PDB file. I have 2000 PDB files I need calculations done on.

This is my full current script with a few notes.

Full script:

#!/usr/bin/perl

use warnings;
use strict;

#my $inputfile = $ARGV[0];
#my $inputfile = '8ns_emb_alt_test.pdb';
my $inputfile = '8ns_emb_alt_101.pdb';

open( INPUTFILE, "<", $inputfile ) or die $!;

my @array = <INPUTFILE>;

### Protein
my $protein = 'PROT';
my @protx;
my @proty;
my @protz;

for ( my $line = 0; $line <= $#array; ++$line ) {
    if ( ( $array[$line] =~ m/\s+$protein\s+/ ) ) {
        chomp $array[$line];
        my @splitline = ( split /\s+/, $array[$line] );
        push @protx, $splitline[5];    #This has 2083 x-coordinates
        push @proty, $splitline[6];    #this has 2083 y-coordinates
        push @protz, $splitline[7];    #this has 2083 z-coordinates
    }
}

### Lipid
my $lipid1 = 'POPS';
my $lipid2 = 'POPC';
my @lipidx;
my @lipidy;
my @lipidz;

for ( my $line = 0; $line <= $#array; ++$line ) {
    if ( ( $array[$line] =~ m/\s+$lipid1\s+/ ) || ( $array[$line] =~ m/\s+$lipid2\s+/ ) ) {
        chomp $array[$line];
        my @splitline = ( split /\s+/, $array[$line] );
        push @lipidx, $splitline[5];    #this has approximately 35,000 x coordinates
        push @lipidy, $splitline[6];    #same as above for y
        push @lipidz, $splitline[7];    #same as above for z
    }
}

### Calculation
my @deltaX = map {
    my $diff = $_;
    map { $diff - $_ } @lipidx
} @protx;    #so this has 2083*35000 data x-coordinates
my @squared_deltaX = map { $_ * $_ } @deltaX;    #this has all the x-coordinates squared from @deltaX

my @deltaY = map {
    my $diff = $_;
    map { $diff - $_ } @lipidy
} @proty;
my @squared_deltaY = map { $_ * $_ } @deltaY;

my @deltaZ = map {
    my $diff = $_;
    map { $diff - $_ } @lipidz
} @protz;
my @squared_deltaZ = map { $_ * $_ } @deltaZ;

my @distance;
for ( my $ticker = 0; $ticker <= $#array; ++$ticker ) {
    my $distance_calc = sqrt( ( $squared_deltaX[$ticker] + $squared_deltaY[$ticker] + $squared_deltaZ[$ticker] ) );
    push @distance, $distance_calc;
}    #this runs the final calculation and computes all the distances between the atoms

### The Hunt
my $limit = 5;
my @DistU50;
my @resid_tagger;

for ( my $tracker = 0; $tracker <= $#array; ++$tracker ) {
    my $dist = $distance[$tracker];
    if ( ( $dist < $limit ) && ( $array[$tracker] =~ m/\s+$protein\s+/ ) ) {
        my @splitline = ( split /\s+/, $array[$tracker] );
        my $LT50 = $dist;
        push @resid_tagger, $splitline[4];    #selects stores a selected index number
        push @DistU50,      $LT50;            #stores the values within the $limit
    }
} #this 'for' loop search for all the elements in the '@distance' and pushes them to the final arrays and also makes an array of certain index numbers in to another array.

### Le'Finali
print "@resid_tagger = resid \n";
print "5 > @DistU50 \n";

close INPUTFILE;

One of my lab friends said that I could store some of the data into files so that it takes up less memory. I think that is a fine idea but I am not sure where would be the most efficient place to do that and how many times would I have to do that. I did this with arrays because it is the best way I knew how to do this.

If anyone could show me an example of where I could take an array and turn it into a file and then use the data in that file again that would be really helpful. Otherwise, if any ones has an idea that I can look up, things to try or just suggestions that would be at least start me somewhere.

I am not a chemist... I have no idea what you're doing exactly. Personally, if I was working with large datasets... I would use a database. (but I'm sure there's a good chance that you need functions that you can't mimic in a db, too...) — gloomy.penguin
– gloomy.penguin, Commented Oct 10, 2014 at 16:35
Well it just finds the distance between coordinates in a file. I dont really know what you mean by use a database and the problem is dealing with the arrays. I have made some large arrays and through the script they are used in various ways. I am just trying to find away so that less information is being used by the memory of my computer. — Chemist not a Programmer
– Chemist not a Programmer, Commented Oct 10, 2014 at 16:39
Yeah. Personally, I would parse the files and load them into tables in a database because they're meant to handle large data and calculations. is this what we're talking about? — gloomy.penguin
– gloomy.penguin, Commented Oct 10, 2014 at 16:40
Yea that is the exact kind of file I am using. If you want to show me an example of parsing them into a table that would be really help, if you can. — Chemist not a Programmer
– Chemist not a Programmer, Commented Oct 10, 2014 at 16:44
okay. i will.... hey, quick question. in the file format, is there always whitespace separating values? or can it run together (and is based specifically on column position for when a value starts/stops)? (I skimmed that link I posted... lol, sorry) Also... do you have a database to use? Mysql or sql server? — gloomy.penguin
– gloomy.penguin, Commented Oct 10, 2014 at 16:55

Community · Accepted Answer · 2020-06-20 09:12:55Z

You're trying to store ~66 million results in an array, which as you've noticed is both slow and memory intensive. Perl arrays are not great for massive calculations like this, but PDL is.

The core of your problem is to calculate the distance between a number of 3D coordinates. Let's do this for a simplified data set first just to prove we can do it:

  Start         End
---------    ---------
(0, 0, 0)    (1, 2, 3)
(1, 1, 1)    (1, 1, 1)
(4, 5, 6)    (7, 8, 9)

We can represent this data set in PDL like this:

use PDL;
                  # x        # y        # z    
my $start = pdl [ [0, 1, 4], [0, 1, 5], [0, 1, 6] ];
my $end   = pdl [ [1, 1, 7], [2, 1, 8], [3, 1, 9] ];

We now have two sets of 3D coordinates. To compute the distances, first we subtract our start coordinates from our end coordinates:

my $diff = $end - $start;
print $diff;

This outputs

[
 [1 0 3]
 [2 0 3]
 [3 0 3]
]

where the differences in the x-coordinates are in the first row, the differences in the y-coordinates are in the second row, and the differences in the z-coordinates are in the third row.

Next we have to square the differences:

my $squared = $diff**2;
print $squared;

which gives us

[
 [1 0 9]
 [4 0 9]
 [9 0 9]
]

Finally we need to sum the square of the differences for each pair of points and take the square root:

foreach my $i (0 .. $squared->dim(0) - 1) {
    say sqrt sum $squared($i,:);
}

(There's probably a better way to do this, but I haven't used PDL much.)

This prints out

3.74165738677394
0
5.19615242270663

which are our distances.

Putting it all together:

use strict;
use warnings;
use 5.010;

use PDL;
use PDL::NiceSlice;

my $start = pdl [ [0, 1, 4], [0, 1, 5], [0, 1, 6] ];
my $end   = pdl [ [1, 1, 7], [2, 1, 8], [3, 1, 9] ];

my $diff = $end - $start;
my $squared = $diff**2;

foreach my $i (0 .. $squared->dim(0) - 1) {
    say sqrt sum $squared($i,:);
}

It takes ~35 seconds on my desktop to calculate the distance between one million pairs of coordinates and write the results to a file. When I try with ten million pairs, I run out of memory, so you'll probably have to split your data set into pieces.

Reading data from files

Here's an example that reads data in from two files, using sample input you included in an earlier question:

use strict;
use warnings;
use 5.010;

use PDL;
use PDL::IO::Misc;
use PDL::NiceSlice;

my $start_file = 'start.txt';
my $end_file   = 'end.txt';

my $start = rcols $start_file, [ 5..7 ];
my $end   = rcols $end_file, [ 5..7 ];

my $diff = $end - $start;
my $squared = $diff**2;
   
foreach my $i (0 .. $squared->dim(0) - 1) {
    say sqrt sum $squared($i,:);
}

start.txt

ATOM      1  N   GLU    1     -19.992  -2.816  36.359  0.00  0.00      PROT
ATOM      2  HT1 GLU    1     -19.781  -1.880  35.958  0.00  0.00      PROT
ATOM      3  HT2 GLU    1     -19.713  -2.740  37.358  0.00  0.00      PROT
ATOM      4  HT3 GLU    1     -21.027  -2.910  36.393  0.00  0.00      PROT
ATOM      5  CA  GLU    1     -19.344  -3.944  35.652  0.00  0.00      PROT
ATOM      6  HA  GLU    1     -19.817  -4.852  35.998  0.00  0.00      PROT
ATOM      7  CB  GLU    1     -19.501  -3.795  34.119  0.00  0.00      PROT

end.txt

ATOM   2084  N   POPC   1     -44.763  27.962  20.983  0.00  0.00      MEM1  
ATOM   2085  C12 POPC   1     -46.144  27.379  20.551  0.00  0.00      MEM1  
ATOM   2086  C13 POPC   1     -44.923  28.611  22.367  0.00  0.00      MEM1  
ATOM   2087  C14 POPC   1     -43.713  26.889  21.099  0.00  0.00      MEM1  
ATOM   2088  C15 POPC   1     -44.302  29.004  20.059  0.00  0.00      MEM1  
ATOM   2089 H12A POPC   1     -46.939  28.110  20.555  0.00  0.00      MEM1  
ATOM   2090 H12B POPC   1     -46.486  26.769  21.374  0.00  0.00      MEM1

Output

42.3946824613654
42.2903357636233
42.9320321205507
40.4541893133455
44.1770768272415
45.3936402704167
42.7174829080553

The rcols function comes from PDL::IO::Misc and can be used to read specific columns from a file into a PDL object (in this cases, columns 5 through 7, zero-indexed).

gloomy.penguin · Accepted Answer · 2014-10-10 19:02:10Z

edit.....

you guys... we should have checked first. you might want to look into the perl modules that already exist for processing and manipulating PDB data.

Okay... so perl is not my first language and I don't know exactly what your data looks like.

edit: there was weird row in my test data. there are two sets of code here... one splits on whitespace and the other uses expected/known column positions and length for determining values.

#!/usr/bin/perl
use strict;
use warnings;

use DBI; 

my $db   = 'test'; 
my $host = 'localhost'; 
my $user = 'root';
my $pass = ''; 

my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass) 
            or die "Connection Error: $DBI::errstr\n";

my $localpath = 'C:\path\to\folder\with\datums';

my @filenames = ('glucagon');  # i am using this as my table name, too 
my $colnum    = 12;  # number of columns in data, I assumed this was fixed

my @placehoders; 
for (1..$colnum) { push @placehoders, '?'; } 
my $placeholders = join(',',@placehoders);   # builds a string like:  ?, ?, ?, ?, ...
                                             # for our query that uses binding 
foreach my $file (@filenames) {
   my $filename = "$localpath\\$file.txt"; 

   if (open(my $fh => $filename)) {  

      # the null at the start of the insert is because my first column is an 
      # auto_incrememnt primary key that will be generated on insert by the db 
      my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)");

      while(my $line = <$fh>) {
         $line =~ s/\s+$//; # trim whitespace

         if ($line ne q{}) {  # if not totally blank
            my @row = split(/ +/,$line); # split on whitespace

            for my $index (1..$colnum) { 
               $stmt->bind_param($index, $row[$index]);
               $index++;
            }
            $stmt->execute(); 
         }
      }
      close $fh; 
   }
   else {  print "$file not opened\n";  }
}

-- i didn't know appropriate names for any of it 
create table glucagon (
   row_id   int unsigned auto_increment primary key,
   name    varchar(10),
   seq     int,
   code1   varchar(5),
   code2   varchar(5),
   code3   varchar(5),
   code4   int,
   val1    decimal(10,2),
   val2    decimal(10,2),
   val3    decimal(10,2),
   val4    decimal(10,2),
   val5    decimal(10,2),
   code5   varchar(5)
)

the following is found in C:\path\to\folder\with\datums\glucagon.txt

ATOM   1058  N   ARG A 141      -6.466  12.036 -10.348  7.00 19.11           N
ATOM   1059  CA  ARG A 141      -7.922  12.248 -10.253  6.00 26.80           C
ATOM   1060  C   ARG A 141      -8.119  13.499  -9.393  6.00 28.93           C
ATOM   1061  O   ARG A 141      -7.112  13.967  -8.853  8.00 28.68           O
ATOM   1062  CB  ARG A 141      -8.639  11.005  -9.687  6.00 24.11           C
ATOM   1063  CG  ARG A 141      -8.153  10.551  -8.308  6.00 19.20           C
ATOM   1064  CD  ARG A 141      -8.914   9.319  -7.796  6.00 21.53           C
ATOM   1065  NE  ARG A 141      -8.517   9.076  -6.403  7.00 20.93           N
ATOM   1066  CZ  ARG A 141      -9.142   8.234  -5.593  6.00 23.56           C
ATOM   1067  NH1 ARG A 141     -10.150   7.487  -6.019  7.00 19.04           N
ATOM   1068  NH2 ARG A 141      -8.725   8.129  -4.343  7.00 25.11           N
ATOM   1069  OXT ARG A 141      -9.233  14.024  -9.296  8.00 40.35           O
TER    1070      ARG A 141
HETATM 1071 FE   HEM A   1       8.128   7.371 -15.022 24.00 16.74          FE
HETATM 1072  CHA HEM A   1       8.617   7.879 -18.361  6.00 17.74           C
HETATM 1073  CHB HEM A   1      10.356  10.005 -14.319  6.00 18.92           C
HETATM 1074  CHC HEM A   1       8.307   6.456 -11.669  6.00 11.00           C
HETATM 1075  CHD HEM A   1       6.928   4.145 -15.725  6.00 13.25           C

end result...

mysql> select * from glucagon;
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| row_id | name   | seq  | code1 | code2 | code3 | code4 | val1   | val2  | val3   | val4  | val5  | code5 |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
|      1 | ATOM   | 1058 | N     | ARG   | A     |   141 |  -6.47 | 12.04 | -10.35 |  7.00 | 19.11 | N     |
|      2 | ATOM   | 1059 | CA    | ARG   | A     |   141 |  -7.92 | 12.25 | -10.25 |  6.00 | 26.80 | C     |
|      3 | ATOM   | 1060 | C     | ARG   | A     |   141 |  -8.12 | 13.50 |  -9.39 |  6.00 | 28.93 | C     |
|      4 | ATOM   | 1061 | O     | ARG   | A     |   141 |  -7.11 | 13.97 |  -8.85 |  8.00 | 28.68 | O     |
|      5 | ATOM   | 1062 | CB    | ARG   | A     |   141 |  -8.64 | 11.01 |  -9.69 |  6.00 | 24.11 | C     |
|      6 | ATOM   | 1063 | CG    | ARG   | A     |   141 |  -8.15 | 10.55 |  -8.31 |  6.00 | 19.20 | C     |
|      7 | ATOM   | 1064 | CD    | ARG   | A     |   141 |  -8.91 |  9.32 |  -7.80 |  6.00 | 21.53 | C     |
|      8 | ATOM   | 1065 | NE    | ARG   | A     |   141 |  -8.52 |  9.08 |  -6.40 |  7.00 | 20.93 | N     |
|      9 | ATOM   | 1066 | CZ    | ARG   | A     |   141 |  -9.14 |  8.23 |  -5.59 |  6.00 | 23.56 | C     |
|     10 | ATOM   | 1067 | NH1   | ARG   | A     |   141 | -10.15 |  7.49 |  -6.02 |  7.00 | 19.04 | N     |
|     11 | ATOM   | 1068 | NH2   | ARG   | A     |   141 |  -8.73 |  8.13 |  -4.34 |  7.00 | 25.11 | N     |
|     12 | ATOM   | 1069 | OXT   | ARG   | A     |   141 |  -9.23 | 14.02 |  -9.30 |  8.00 | 40.35 | O     |
|     13 | TER    | 1070 | ARG   | A     | 141   |  NULL |   NULL |  NULL |   NULL |  NULL |  NULL | NULL  |
|     14 | HETATM | 1071 | FE    | HEM   | A     |     1 |   8.13 |  7.37 | -15.02 | 24.00 | 16.74 | FE    |
|     15 | HETATM | 1072 | CHA   | HEM   | A     |     1 |   8.62 |  7.88 | -18.36 |  6.00 | 17.74 | C     |
|     16 | HETATM | 1073 | CHB   | HEM   | A     |     1 |  10.36 | 10.01 | -14.32 |  6.00 | 18.92 | C     |
|     17 | HETATM | 1074 | CHC   | HEM   | A     |     1 |   8.31 |  6.46 | -11.67 |  6.00 | 11.00 | C     |
|     18 | HETATM | 1075 | CHD   | HEM   | A     |     1 |   6.93 |  4.15 | -15.73 |  6.00 | 13.25 | C     |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
18 rows in set (0.00 sec)

ohh... look... this row makes it dirty... TER 1070 ARG A 141. i can easily fix this if you go my route but if you use the other answer/approach, i'm not going to bother to update this.

okay... for the stupid row. I went through and counted the starting position and length of each value in my test dataset. I didn't know if that information changes for you or not when you load different files or what... so I made it in a way that it can be set for each file you use.

#!/usr/bin/perl
use strict;
use warnings;

use DBI;  

my $db   = 'test'; 
my $host = 'localhost'; 
my $user = 'root';
my $pass = ''; 

my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass) 
            or die "Connection Error: $DBI::errstr\n";

my $localpath = 'C:\path\to\datums'; 

                                # first num is starting pos, second is length
my $fileinfo = { 'glucagon' => [[0,6],   #  'name'
                                [7,4],   #  'seq'
                                [12,4],  #  'code1'
                                [17,3],  #  'code2'
                                [21,1],  #  'code3'
                                [23,3],  #  'code4'
                                [27,12], #  'val1'
                                [39,7],  #  'val2'
                                [47,7],  #  'val3'
                                [55,5],  #  'val4'
                                [61,5],  #  'val5'
                                [69,10]  #  'code5'
                                ] 
               #  'second_file' => [ [0,5], # col1 
               #                     [6,5], # col2                     
               #                   ]  
               };  # i am using this as my table name, too  

foreach my $file (keys %$fileinfo) {
   my $filename = "$localpath\\$file.txt";  

   if (open(my $fh => $filename)) {   
      my $colnum = scalar @{$fileinfo->{$file}}; 

      my @placehoders; 
      for (1..$colnum) { push @placehoders, '?'; } 
      my $placeholders = join(',',@placehoders);   # builds a string like:  ?, ?, ?, ?, ...
                                                   # for our query that uses binding 

      # the null at the start of the insert is because my first column is an 
      # auto_incrememnt primary key that will be generated on insert by the db 
      my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)");

      while(my $line = <$fh>) {
         $line =~ s/\s+$//;   # trim trailing whitespace 
         if ($line ne q{}) {  # if not totally blank  
            my @row;
            my $index = 1; 
            # foreach value column position & length 
            foreach my $col (@{$fileinfo->{$file}}) {   
               my $value;
               if ($col->[0] <= length($line)) {
                  $value = substr($line,$col->[0],$col->[1]); 
                  $value =~ s/^\s+|\s+$//g;  # trim trailing & leading whitespace
                  if ($value eq q{}) { undef $value; } # i like null values vs blank
               }
               $row[$index] = $value;
               $index++; 
            } 

            for my $index (1..$colnum) { 
               $stmt->bind_param($index, $row[$index]); 
            }
            $stmt->execute(); 
         }
      }
      close $fh; 
   }
   else {  print "$file not opened\n";  }
}

new data:

mysql> select * from  glucagon;
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| row_id | name   | seq  | code1 | code2 | code3 | code4 | val1   | val2  | val3   | val4  | val5  | code5 |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
|      1 | ATOM   | 1058 | N     | ARG   | A     |   141 |  -6.47 | 12.04 | -10.35 |  7.00 | 19.11 | N     |
|      2 | ATOM   | 1059 | CA    | ARG   | A     |   141 |  -7.92 | 12.25 | -10.25 |  6.00 | 26.80 | C     |
|      3 | ATOM   | 1060 | C     | ARG   | A     |   141 |  -8.12 | 13.50 |  -9.39 |  6.00 | 28.93 | C     |
|      4 | ATOM   | 1061 | O     | ARG   | A     |   141 |  -7.11 | 13.97 |  -8.85 |  8.00 | 28.68 | O     |
|      5 | ATOM   | 1062 | CB    | ARG   | A     |   141 |  -8.64 | 11.01 |  -9.69 |  6.00 | 24.11 | C     |
|      6 | ATOM   | 1063 | CG    | ARG   | A     |   141 |  -8.15 | 10.55 |  -8.31 |  6.00 | 19.20 | C     |
|      7 | ATOM   | 1064 | CD    | ARG   | A     |   141 |  -8.91 |  9.32 |  -7.80 |  6.00 | 21.53 | C     |
|      8 | ATOM   | 1065 | NE    | ARG   | A     |   141 |  -8.52 |  9.08 |  -6.40 |  7.00 | 20.93 | N     |
|      9 | ATOM   | 1066 | CZ    | ARG   | A     |   141 |  -9.14 |  8.23 |  -5.59 |  6.00 | 23.56 | C     |
|     10 | ATOM   | 1067 | NH1   | ARG   | A     |   141 | -10.15 |  7.49 |  -6.02 |  7.00 | 19.04 | N     |
|     11 | ATOM   | 1068 | NH2   | ARG   | A     |   141 |  -8.73 |  8.13 |  -4.34 |  7.00 | 25.11 | N     |
|     12 | ATOM   | 1069 | OXT   | ARG   | A     |   141 |  -9.23 | 14.02 |  -9.30 |  8.00 | 40.35 | O     |
|     13 | TER    | 1070 | NULL  | ARG   | A     |   141 |   NULL |  NULL |   NULL |  NULL |  NULL | NULL  |
|     14 | HETATM | 1071 | FE    | HEM   | A     |     1 |   8.13 |  7.37 | -15.02 | 24.00 | 16.74 | FE    |
|     15 | HETATM | 1072 | CHA   | HEM   | A     |     1 |   8.62 |  7.88 | -18.36 |  6.00 | 17.74 | C     |
|     16 | HETATM | 1073 | CHB   | HEM   | A     |     1 |  10.36 | 10.01 | -14.32 |  6.00 | 18.92 | C     |
|     17 | HETATM | 1074 | CHC   | HEM   | A     |     1 |   8.31 |  6.46 | -11.67 |  6.00 | 11.00 | C     |
|     18 | HETATM | 1075 | CHD   | HEM   | A     |     1 |   6.93 |  4.15 | -15.73 |  6.00 | 13.25 | C     |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+

Collectives™ on Stack Overflow

Script Failure because of Large Arrays

2 Answers 2

Reading data from files

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Reading data from files

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related