1

I am kind of confused as to what is happening here. Most benchmarks I have seen have Boost being close to Perl or even beating it in terms of performance. In my scripts however, my Perl implementation is faster in order of 5-6 times.

I open files in both test_script.cpp & test_script.pl and read in line by line, populating an array. Then, I run these strings against a list of regex definitions in a linear definition until they match, in which case nothing happens (I/O was removed for testing purposes) and then the next string is compared, etc until we have compared all strings.

Test_script.pl:

#make incomingList, which contains all incoming strings
my $start = Time::HiRes::gettimeofday();

foreach (@incomingList) {
  my $inString = $_;
  &find_pattern($inString);
}

my $end = Time::HiRes::gettimeofday();
printf("%.6f\n", $end - $start);

Find_pattern method:

sub find_pattern {
  my $URLString = $_[0];

  #1 rewrite
  if($URLString =~ m/^\/stuff\/brands-([^\/]*)\/(.*)?$/) {

  }
  #2 rewrite
  elsif($URLString =~ m/^\/coupons(\/.*)?$/){

  }
  #3 rewrite
  elsif($URLString =~ m/^\/han\/(.+)$/){

  }
  # ...continues on, there are 100 patterns. 
}

Test_script.cpp: Main method:

populateArray();
//make stringArr, which contains all incoming strings
struct timeval time;
gettimeofday(&time, NULL);
double t1=time.tv_sec+(time.tv_usec/1000000.0);   

for(int j =0; j < 10000; j++){
  getRule(stringArr[j]);
 }

gettimeofday(&time, NULL);
double t2=time.tv_sec+(time.tv_usec/1000000.0);
printf("%.6lf seconds elapsed\n", t2-t1);

populate array method:

static void populateArray(){
regexArray[1] =  boost::regex ("\\/stuff\\/brands-([^\\/]*)\\/(.*)?");
regexArray[2] =  boost::regex ("\\/coupons(\\/.*)?");
regexArray[3] =  boost::regex ("\\/han\\/(.+)"); 
//continues on, 100 definitions. 
}

getRule method:

static void getRule(string inQuery){
  for(int i =1; i < 100; i++){
    if(boost::regex_match(inQuery, regexArray[i])){
      break; 
     }
  }

I understand that it might seem a little odd that I'm doing a linear list of if else checks in perl, but that's because I have to reformat each rule independently later. Regardless, unless I'm misunderstanding something, these two scripts are pretty similar- they look down this list of regex definitions until they find a match, and then they continue with other incoming strings.

So then why are these results so different? For 100 rules (same used for both scripts) & 10,000 inputs, The .cpp averages to around 0.155 seconds, and the .pl averages to around 0.028 seconds. Edit: With compiler optimization in place, the C++ script is operating at roughly 0.091 seconds, still slower.

Any insight is appreciated.

14
  • 2
    Did you compile with optimizations? Are you running in debug? Commented Jun 28, 2016 at 19:12
  • Note that the second pattern in perl version isn't anchored at the end. Other thing, I don't use boost, but if I remember well the default mode use the ECMA regex engine, did you try to use the PCRE regex engine (that has more optimization features)? Commented Jun 28, 2016 at 19:12
  • 2
    @Yayahii You need to build your C++ application with optimizations turned on, meaning -O2, -O3, etc. The results you're seeing are meaningless if you're timing an unoptimized build. In addition, giving us results from an optimized build ensures we're not wasting our time trying to solve an issue when there is no issue. Commented Jun 28, 2016 at 19:41
  • 1
    @Yayahii The -o option is not for optimizations. It is an object file directive. Right now, you're compiling with the default -O0 which is no optimizations and thus your question concerning performance is still uncertain wrt the timings you're showing us and your claim that Perl is faster than Boost. Please see this. Please specify one of the options I mentioned (-O2, -O3, etc.) Commented Jun 28, 2016 at 21:34
  • 1
    I don't know a thing about Boost, but your Perl regexps start with ^ while your Boost ones don't. Commented Jun 29, 2016 at 9:25

1 Answer 1

3

In addition to turning on the compiler optimisation settings, try using the boost::regex_constants::optimize option which will direct the regex library to construct the most optimal regex state machine.

static void populateArray(){
regexArray[1] =  boost::regex ("\\/stuff\\/brands-([^\\/]*)\\/(.*)?", boost::regex_constants::optimize);
//continues on, 102 definitions. 
}

Also, be sure to pass by reference to getRule rather than by value because you don't want the potential overhead of a heap allocation.

If you can make sure the compiler inlines the function, that would best.

Also, as Oals commented above, you have not used the begin and end line anchors in the C++ regex expressions like you have in the Perl ones. ^...$

Sign up to request clarification or add additional context in comments.

1 Comment

This could honestly be a part of it, though I am not sure that it would account alone for all this difference? Thank you, I will look into it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.