1

I have a Perl script that compares two sets of data loaded into two arrays, and I'm trying to make the comparison more efficient. Currently the code is as follows:

foreach(@{FILE_DATA}) {

    if((++$file_current_linenum % 200) == 0) {
        $progress = int($file_current_linenum / $file_total_lines * 10000) / 100;
        logg("Processed $file_current_linenum file rows, $progress%, $mismatches mismatches.");
    }


    $file_current_line = $_;

    $match_found = 0;

    foreach(@{DB_DATA}) {

        $db_current_line = $_;

        if($file_current_line->{"channel"} == $db_current_line->{"channel"} ) {

            if ($file_current_line->{"checksum"} == $db_current_line->{"checksum"} &&
                $file_current_line->{"time"} > ($db_current_line->{"date_part"} - $TIME_MATCH_TOLERANCE) && 
                $file_current_line->{"time"} < ($db_current_line->{"date_part"} + $TIME_MATCH_TOLERANCE) ){

                $match_found = 1;
                last; # break;
            }
        }
    }


    if($match_found != 1) {

        push(@results, $file_current_line);
        $mismatches++;

    }
}

My first thought would be to remove matches from both arrays to reduce the pool size, would that affect the iterators position?

Both sets of data can have up to a couple of million elements and the comparison can take a few hours to complete.

2 Answers 2

3

Your solution is O(DB * FILE).

The following is O(DB + FILE) if and only if there never more than a few lines with the same channel and checksum:

my %DB_DATA;
for my $db_line (@DB_DATA) {
   push @{ $DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} } }, $db_line;
}

for my $file_line_idx (0..$#FILE_DATA) {
   my $file_line = $FILE_DATA[$file_line_idx];
   my $found = 0;
   if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
      if (my $r2 = $r1->{ $file_line->{checksum} } ) {
         my $file_time = $file_line->{time};
         for my $db_line (@$r2) {
            my $db_time = $db_line->{date_part};
            if (abs($file_time - $db_time) < $TIME_MATCH_TOLERANCE) {
               $found = 1;
               last;
            }
         }
      }
   }

   push @mismatches, $file_line if !$found;

   if ((($file_line_idx+1) % 200) == 0) {
      logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
         $file_line_idx+1,
         int(($file_line_idx+1)/@FILE_DATA) * 100,
         0+@mismatches,
      ));
   }
}

The following is O(DB + FILE) even if there are many lines with the same channel and checksum, but uses a lot of memory if $TIME_MATCH_TOLERANCE is big:

my %DB_DATA;
for my $db_line (@DB_DATA) {
   for my $db_time (
      $db_line->{date_part} - $TIME_MATCH_TOLERANCE + 1
        ..
      $db_line->{date_part} + $TIME_MATCH_TOLERANCE - 1
   ) {
      ++$DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} }{$db_time};
   }
}

for my $file_line_idx (0..$#FILE_DATA) {
   my $file_line = $FILE_DATA[$file_line_idx];
   my $found = 0;
   if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
      if (my $r2 = $r1->{ $file_line->{checksum} } ) {
         if ($r2->{ $file_line->{time} } {
            $found = 1;
            last;
         }
      }
   }

   push @mismatches, $file_line if !$found;

   if ((($file_line_idx+1) % 200) == 0) {
      logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
         $file_line_idx+1,
         int(($file_line_idx+1)/@FILE_DATA) * 100,
         0+@mismatches,
      ));
   }
}

Note: Assumes the timestamps are integers. If they're not, convert them to integers before using them as keys.


The following is O((DB + FILE) log DB) [ which is very close to O(DB + FILE) ] even if there are many lines with the same channel and checksum, and uses minimal memory:

sub binsearch(&\@) {
   my ($compare, $array) = @_;

   my $i = 0;
   my $j = $#$array;
   return 0 if $j == -1;

   while (1) {
      my $k = int(($i+$j)/2);
      for ($array->[$k]) {
         my $cmp = $compare->()
            or return 1;

         if ($cmp < 0) {
            $j = $k-1;
            return 0 if $i > $j;
         } else {
            $i = $k+1;
            return 0 if $i > $j;
         }
      }
   }
}

my %DB_DATA;
for my $db_line (@DB_DATA) {
   push @{ $DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} } }, $db_line;
}

for my $r1 (values(%DB_DATA)) {
   for my $r2 (values(%$r1)) {
      @$r2 = sort { $a->{date_part} <=> $b->{date_part} } @$r2;
   }
}

for my $file_line_idx (0..$#FILE_DATA) {
   my $file_line = $FILE_DATA[$file_line_idx];
   my $found = 0;
   if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
      if (my $r2 = $r1->{ $file_line->{checksum} } ) {
         my $file_time = $file_line->{time};
         my $min_db_time = $file_time - $TIME_MATCH_TOLERANCE;
         my $max_db_time = $file_time + $TIME_MATCH_TOLERANCE;
         if ( binsearch {
              $_->{date_part} >= $max_db_time ? -1
            : $_->{date_part} <= $min_db_time ? +1
            : 0
         } @$r2 ) {
            $found = 1;
            last;
         }
      }
   }

   push @mismatches, $file_line if !$found;

   if ((($file_line_idx+1) % 200) == 0) {
      logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
         $file_line_idx+1,
         int(($file_line_idx+1)/@FILE_DATA) * 100,
         0+@mismatches,
      ));
   }
}
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for this awesome response! I'm getting an error trying to implement it though: "Global symbol "%DB_DATA" requires explicit package name at ...", regarding this line: "push @{ $DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} } }, $db_line;" (sorry, Perl is still quite foreign to me)
That means you left out the my %DB_DATA;
Okay, I don't understand the syntax then. DB_DATA is a global and it's already declared with "my @DB_DATA;" (and populated with the data from the database).
That @DB_DATA is a file-scoped variable declared with my @DB_DATA; has no bearing on the fact that you didn't declare %DB_DATA, a hash populated from @DB_DATA by my code. Put the my %DB_DATA; back in.
0

You could probably reduce the time significantly by pre-building a hash from DB_DATA, using the concatenation of the "channel" and "checksum" values as the key, and each value being a list of all of the DB_DATA entries with that channel and checksum. That way, for each FILE_DATA entry, you only need to check that list.

If there are a lot of entries with a given channel and checksum, you try to improve even more by sorting them by date_part, and then trying to binary search to find a valid entry.

If there are very few entries with a given channel and checksum, this should reduce your run time by a factor of a million or so, since it reduces the run time from O($#FILE_DATA * $#DB_DATA) to O($#FILE_DATA + $#DB_DATA).

1 Comment

To explain with database terminology: the original code is a nested loop join which can be optimized by converting it to a hash join

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.