How can I speed up this nested foreach loop in Perl?

Question

I have a Perl script that compares two sets of data loaded into two arrays, and I'm trying to make the comparison more efficient. Currently the code is as follows:

foreach(@{FILE_DATA}) {

    if((++$file_current_linenum % 200) == 0) {
        $progress = int($file_current_linenum / $file_total_lines * 10000) / 100;
        logg("Processed $file_current_linenum file rows, $progress%, $mismatches mismatches.");
    }


    $file_current_line = $_;

    $match_found = 0;

    foreach(@{DB_DATA}) {

        $db_current_line = $_;

        if($file_current_line->{"channel"} == $db_current_line->{"channel"} ) {

            if ($file_current_line->{"checksum"} == $db_current_line->{"checksum"} &&
                $file_current_line->{"time"} > ($db_current_line->{"date_part"} - $TIME_MATCH_TOLERANCE) && 
                $file_current_line->{"time"} < ($db_current_line->{"date_part"} + $TIME_MATCH_TOLERANCE) ){

                $match_found = 1;
                last; # break;
            }
        }
    }


    if($match_found != 1) {

        push(@results, $file_current_line);
        $mismatches++;

    }
}

My first thought would be to remove matches from both arrays to reduce the pool size, would that affect the iterators position?

Both sets of data can have up to a couple of million elements and the comparison can take a few hours to complete.

ikegami · Accepted Answer · 2017-06-15 06:09:55Z

3

Your solution is O(DB * FILE).

The following is O(DB + FILE) if and only if there never more than a few lines with the same channel and checksum:

my %DB_DATA;
for my $db_line (@DB_DATA) {
   push @{ $DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} } }, $db_line;
}

for my $file_line_idx (0..$#FILE_DATA) {
   my $file_line = $FILE_DATA[$file_line_idx];
   my $found = 0;
   if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
      if (my $r2 = $r1->{ $file_line->{checksum} } ) {
         my $file_time = $file_line->{time};
         for my $db_line (@$r2) {
            my $db_time = $db_line->{date_part};
            if (abs($file_time - $db_time) < $TIME_MATCH_TOLERANCE) {
               $found = 1;
               last;
            }
         }
      }
   }

   push @mismatches, $file_line if !$found;

   if ((($file_line_idx+1) % 200) == 0) {
      logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
         $file_line_idx+1,
         int(($file_line_idx+1)/@FILE_DATA) * 100,
         0+@mismatches,
      ));
   }
}

The following is O(DB + FILE) even if there are many lines with the same channel and checksum, but uses a lot of memory if $TIME_MATCH_TOLERANCE is big:

my %DB_DATA;
for my $db_line (@DB_DATA) {
   for my $db_time (
      $db_line->{date_part} - $TIME_MATCH_TOLERANCE + 1
        ..
      $db_line->{date_part} + $TIME_MATCH_TOLERANCE - 1
   ) {
      ++$DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} }{$db_time};
   }
}

for my $file_line_idx (0..$#FILE_DATA) {
   my $file_line = $FILE_DATA[$file_line_idx];
   my $found = 0;
   if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
      if (my $r2 = $r1->{ $file_line->{checksum} } ) {
         if ($r2->{ $file_line->{time} } {
            $found = 1;
            last;
         }
      }
   }

   push @mismatches, $file_line if !$found;

   if ((($file_line_idx+1) % 200) == 0) {
      logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
         $file_line_idx+1,
         int(($file_line_idx+1)/@FILE_DATA) * 100,
         0+@mismatches,
      ));
   }
}

Note: Assumes the timestamps are integers. If they're not, convert them to integers before using them as keys.

The following is O((DB + FILE) log DB) [ which is very close to O(DB + FILE) ] even if there are many lines with the same channel and checksum, and uses minimal memory:

sub binsearch(&\@) {
   my ($compare, $array) = @_;

   my $i = 0;
   my $j = $#$array;
   return 0 if $j == -1;

   while (1) {
      my $k = int(($i+$j)/2);
      for ($array->[$k]) {
         my $cmp = $compare->()
            or return 1;

         if ($cmp < 0) {
            $j = $k-1;
            return 0 if $i > $j;
         } else {
            $i = $k+1;
            return 0 if $i > $j;
         }
      }
   }
}

my %DB_DATA;
for my $db_line (@DB_DATA) {
   push @{ $DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} } }, $db_line;
}

for my $r1 (values(%DB_DATA)) {
   for my $r2 (values(%$r1)) {
      @$r2 = sort { $a->{date_part} <=> $b->{date_part} } @$r2;
   }
}

for my $file_line_idx (0..$#FILE_DATA) {
   my $file_line = $FILE_DATA[$file_line_idx];
   my $found = 0;
   if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
      if (my $r2 = $r1->{ $file_line->{checksum} } ) {
         my $file_time = $file_line->{time};
         my $min_db_time = $file_time - $TIME_MATCH_TOLERANCE;
         my $max_db_time = $file_time + $TIME_MATCH_TOLERANCE;
         if ( binsearch {
              $_->{date_part} >= $max_db_time ? -1
            : $_->{date_part} <= $min_db_time ? +1
            : 0
         } @$r2 ) {
            $found = 1;
            last;
         }
      }
   }

   push @mismatches, $file_line if !$found;

   if ((($file_line_idx+1) % 200) == 0) {
      logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
         $file_line_idx+1,
         int(($file_line_idx+1)/@FILE_DATA) * 100,
         0+@mismatches,
      ));
   }
}

edited Jun 15, 2017 at 6:09

answered Jun 15, 2017 at 5:05

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

TobiV Over a year ago

Thanks for this awesome response! I'm getting an error trying to implement it though: "Global symbol "%DB_DATA" requires explicit package name at ...", regarding this line: "push @{ $DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} } }, $db_line;" (sorry, Perl is still quite foreign to me)

ikegami Over a year ago

That means you left out the my %DB_DATA;

TobiV Over a year ago

Okay, I don't understand the syntax then. DB_DATA is a global and it's already declared with "my @DB_DATA;" (and populated with the data from the database).

ikegami Over a year ago

That @DB_DATA is a file-scoped variable declared with my @DB_DATA; has no bearing on the fact that you didn't declare %DB_DATA, a hash populated from @DB_DATA by my code. Put the my %DB_DATA; back in.

happydave · Accepted Answer · 2017-06-15 03:33:43Z

0

You could probably reduce the time significantly by pre-building a hash from DB_DATA, using the concatenation of the "channel" and "checksum" values as the key, and each value being a list of all of the DB_DATA entries with that channel and checksum. That way, for each FILE_DATA entry, you only need to check that list.

If there are a lot of entries with a given channel and checksum, you try to improve even more by sorting them by date_part, and then trying to binary search to find a valid entry.

If there are very few entries with a given channel and checksum, this should reduce your run time by a factor of a million or so, since it reduces the run time from O($#FILE_DATA * $#DB_DATA) to O($#FILE_DATA + $#DB_DATA).

answered Jun 15, 2017 at 3:33

happydave

7,2371 gold badge29 silver badges25 bronze badges

1 Comment

user2404501 Over a year ago

To explain with database terminology: the original code is a nested loop join which can be optimized by converting it to a hash join

Collectives™ on Stack Overflow

How can I speed up this nested foreach loop in Perl?

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related