4

I have a keywords list and a blacklist. I want to delete all keywords that contain any of blacklist item. At the moment Im doing it this way:

my @keywords = ( 'some good keyword', 'some other good keyword', 'some bad keyword');
my @blacklist = ( 'bad' );

A: for my $keyword ( @keywords ) {
    B: for my $bl ( @blacklist ) {
        next A if $keyword =~ /$bl/i;      # omitting $keyword
    }
    # some keyword cleaning (for instance: erasing non a-zA-Z0-9 characters, etc)
}

I was wondering is there any fastest way to do this, becouse at the moment I have about 25 milion keywords and couple of hundrets words in blacklist.

2
  • Do you want a new array with filtered @keywords? Commented May 24, 2013 at 9:21
  • It can be a new array. Commented May 24, 2013 at 9:22

3 Answers 3

4

The most straightforward option is to join the blacklist entries into a single regular expression, then grep the keyword list for those which don't match that regex:

#!/usr/bin/env perl    

use strict;
use warnings;
use 5.010;

my @keywords = 
  ('some good keyword', 'some other good keyword', 'some bad keyword');
my @blacklist = ('bad');

my $re = join '|', @blacklist;
my @good = grep { $_ !~ /$re/ } @keywords;

say join "\n", @good;

Output:

some good keyword
some other good keyword
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks a lot! For a test with 50k keywords, execution time went down from 34sec to 0,6sec
metacpan.org/module/Regexp::Assemble - Regexp::Assemble improves performance more.
To demonstrate: perl -MData::Printer -MRegexp::Assemble -E "my $ra = Regexp::Assemble->new(); for my $word (qw/apple asp application aspire applicate aardvark snake/) { $ra->add($word) } p($ra->re);" gives (?:a(?:ppl(?:icat(?:ion|e)|e)|sp(?:ire)?|ardvark)|snake)
3

Precompiling the search may help my @blacklist = ( qr/bad/i ) if you want to keep the nested loops.

Alternatively, changing from my @blacklist = ( 'bad', 'awful', 'worst' ) to my $blacklist = qr/bad|awful|worst/; and then replacing the inner loop with if ( $keywords[$i] =~ $blacklist ) ....

Comments

0

This should do it:

my @indices;
for my $i (0..$#keywords) {
  for my $bl (@blacklist) {
    if ($keywords[$i] =~ $bl) {
      push(@indices, $i);
      last;
    }
  }
}
for my $i (@indices) {
  @keywords = splice(@keywords, $i);
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.