Perl Regular Expression running faster than C++ Boost Implementation

Question

I am kind of confused as to what is happening here. Most benchmarks I have seen have Boost being close to Perl or even beating it in terms of performance. In my scripts however, my Perl implementation is faster in order of 5-6 times.

I open files in both test_script.cpp & test_script.pl and read in line by line, populating an array. Then, I run these strings against a list of regex definitions in a linear definition until they match, in which case nothing happens (I/O was removed for testing purposes) and then the next string is compared, etc until we have compared all strings.

Test_script.pl:

#make incomingList, which contains all incoming strings
my $start = Time::HiRes::gettimeofday();

foreach (@incomingList) {
  my $inString = $_;
  &find_pattern($inString);
}

my $end = Time::HiRes::gettimeofday();
printf("%.6f\n", $end - $start);

Find_pattern method:

sub find_pattern {
  my $URLString = $_[0];

  #1 rewrite
  if($URLString =~ m/^\/stuff\/brands-([^\/]*)\/(.*)?$/) {

  }
  #2 rewrite
  elsif($URLString =~ m/^\/coupons(\/.*)?$/){

  }
  #3 rewrite
  elsif($URLString =~ m/^\/han\/(.+)$/){

  }
  # ...continues on, there are 100 patterns. 
}

Test_script.cpp: Main method:

populateArray();
//make stringArr, which contains all incoming strings
struct timeval time;
gettimeofday(&time, NULL);
double t1=time.tv_sec+(time.tv_usec/1000000.0);   

for(int j =0; j < 10000; j++){
  getRule(stringArr[j]);
 }

gettimeofday(&time, NULL);
double t2=time.tv_sec+(time.tv_usec/1000000.0);
printf("%.6lf seconds elapsed\n", t2-t1);

populate array method:

static void populateArray(){
regexArray[1] =  boost::regex ("\\/stuff\\/brands-([^\\/]*)\\/(.*)?");
regexArray[2] =  boost::regex ("\\/coupons(\\/.*)?");
regexArray[3] =  boost::regex ("\\/han\\/(.+)"); 
//continues on, 100 definitions. 
}

getRule method:

static void getRule(string inQuery){
  for(int i =1; i < 100; i++){
    if(boost::regex_match(inQuery, regexArray[i])){
      break; 
     }
  }

I understand that it might seem a little odd that I'm doing a linear list of if else checks in perl, but that's because I have to reformat each rule independently later. Regardless, unless I'm misunderstanding something, these two scripts are pretty similar- they look down this list of regex definitions until they find a match, and then they continue with other incoming strings.

So then why are these results so different? For 100 rules (same used for both scripts) & 10,000 inputs, The .cpp averages to around 0.155 seconds, and the .pl averages to around 0.028 seconds. Edit: With compiler optimization in place, the C++ script is operating at roughly 0.091 seconds, still slower.

Any insight is appreciated.

Did you compile with optimizations? Are you running in debug? — Ivan Rubinson
– Ivan Rubinson, Commented Jun 28, 2016 at 19:12
Note that the second pattern in perl version isn't anchored at the end. Other thing, I don't use boost, but if I remember well the default mode use the ECMA regex engine, did you try to use the PCRE regex engine (that has more optimization features)? — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Jun 28, 2016 at 19:12
@Yayahii You need to build your C++ application with optimizations turned on, meaning -O2, -O3, etc. The results you're seeing are meaningless if you're timing an unoptimized build. In addition, giving us results from an optimized build ensures we're not wasting our time trying to solve an issue when there is no issue. — PaulMcKenzie
– PaulMcKenzie, Commented Jun 28, 2016 at 19:41
@Yayahii The -o option is not for optimizations. It is an object file directive. Right now, you're compiling with the default -O0 which is no optimizations and thus your question concerning performance is still uncertain wrt the timings you're showing us and your claim that Perl is faster than Boost. Please see this. Please specify one of the options I mentioned (-O2, -O3, etc.) — PaulMcKenzie
– PaulMcKenzie, Commented Jun 28, 2016 at 21:34
I don't know a thing about Boost, but your Perl regexps start with ^ while your Boost ones don't. — oals
– oals, Commented Jun 29, 2016 at 9:25

T33C · Accepted Answer · 2016-06-29 18:34:09Z

3

In addition to turning on the compiler optimisation settings, try using the boost::regex_constants::optimize option which will direct the regex library to construct the most optimal regex state machine.

static void populateArray(){
regexArray[1] =  boost::regex ("\\/stuff\\/brands-([^\\/]*)\\/(.*)?", boost::regex_constants::optimize);
//continues on, 102 definitions. 
}

Also, be sure to pass by reference to getRule rather than by value because you don't want the potential overhead of a heap allocation.

If you can make sure the compiler inlines the function, that would best.

Also, as Oals commented above, you have not used the begin and end line anchors in the C++ regex expressions like you have in the Perl ones. ^...$

edited Jun 29, 2016 at 18:34

answered Jun 28, 2016 at 19:25

T33C

4,4392 gold badges24 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Yayahii Over a year ago

This could honestly be a part of it, though I am not sure that it would account alone for all this difference? Thank you, I will look into it.

Collectives™ on Stack Overflow

Perl Regular Expression running faster than C++ Boost Implementation

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related