Your solution is O(DB * FILE).
The following is O(DB + FILE) if and only if there never more than a few lines with the same channel and checksum:
my %DB_DATA;
for my $db_line (@DB_DATA) {
push @{ $DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} } }, $db_line;
}
for my $file_line_idx (0..$#FILE_DATA) {
my $file_line = $FILE_DATA[$file_line_idx];
my $found = 0;
if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
if (my $r2 = $r1->{ $file_line->{checksum} } ) {
my $file_time = $file_line->{time};
for my $db_line (@$r2) {
my $db_time = $db_line->{date_part};
if (abs($file_time - $db_time) < $TIME_MATCH_TOLERANCE) {
$found = 1;
last;
}
}
}
}
push @mismatches, $file_line if !$found;
if ((($file_line_idx+1) % 200) == 0) {
logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
$file_line_idx+1,
int(($file_line_idx+1)/@FILE_DATA) * 100,
0+@mismatches,
));
}
}
The following is O(DB + FILE) even if there are many lines with the same channel and checksum, but uses a lot of memory if $TIME_MATCH_TOLERANCE is big:
my %DB_DATA;
for my $db_line (@DB_DATA) {
for my $db_time (
$db_line->{date_part} - $TIME_MATCH_TOLERANCE + 1
..
$db_line->{date_part} + $TIME_MATCH_TOLERANCE - 1
) {
++$DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} }{$db_time};
}
}
for my $file_line_idx (0..$#FILE_DATA) {
my $file_line = $FILE_DATA[$file_line_idx];
my $found = 0;
if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
if (my $r2 = $r1->{ $file_line->{checksum} } ) {
if ($r2->{ $file_line->{time} } {
$found = 1;
last;
}
}
}
push @mismatches, $file_line if !$found;
if ((($file_line_idx+1) % 200) == 0) {
logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
$file_line_idx+1,
int(($file_line_idx+1)/@FILE_DATA) * 100,
0+@mismatches,
));
}
}
Note: Assumes the timestamps are integers. If they're not, convert them to integers before using them as keys.
The following is O((DB + FILE) log DB) [ which is very close to O(DB + FILE) ] even if there are many lines with the same channel and checksum, and uses minimal memory:
sub binsearch(&\@) {
my ($compare, $array) = @_;
my $i = 0;
my $j = $#$array;
return 0 if $j == -1;
while (1) {
my $k = int(($i+$j)/2);
for ($array->[$k]) {
my $cmp = $compare->()
or return 1;
if ($cmp < 0) {
$j = $k-1;
return 0 if $i > $j;
} else {
$i = $k+1;
return 0 if $i > $j;
}
}
}
}
my %DB_DATA;
for my $db_line (@DB_DATA) {
push @{ $DB_DATA{ $db_line->{channel} }{ $db_line->{checksum} } }, $db_line;
}
for my $r1 (values(%DB_DATA)) {
for my $r2 (values(%$r1)) {
@$r2 = sort { $a->{date_part} <=> $b->{date_part} } @$r2;
}
}
for my $file_line_idx (0..$#FILE_DATA) {
my $file_line = $FILE_DATA[$file_line_idx];
my $found = 0;
if (my $r1 = $DB_DATA{ $file_line->{channel} } ) {
if (my $r2 = $r1->{ $file_line->{checksum} } ) {
my $file_time = $file_line->{time};
my $min_db_time = $file_time - $TIME_MATCH_TOLERANCE;
my $max_db_time = $file_time + $TIME_MATCH_TOLERANCE;
if ( binsearch {
$_->{date_part} >= $max_db_time ? -1
: $_->{date_part} <= $min_db_time ? +1
: 0
} @$r2 ) {
$found = 1;
last;
}
}
}
push @mismatches, $file_line if !$found;
if ((($file_line_idx+1) % 200) == 0) {
logg(sprintf("Processed %d file rows, %d%, %d mismatches.",
$file_line_idx+1,
int(($file_line_idx+1)/@FILE_DATA) * 100,
0+@mismatches,
));
}
}