8

I have code like the following:

#include <regex>

int main()
{
   char buf[35000] = {};
   auto begin = std::cbegin(buf), end = std::cend(buf);

   std::cmatch groups;
   std::regex::flag_type flags = std::regex_constants::ECMAScript;
   std::regex re(R"REGEX(^[\x02-\x7f]\0..[\x01-\x0c]\0..\0\0)REGEX", flags);
   std::regex_search(begin, end, groups, re);
}

… and noticed that it performed suspiciously slowly.

Investigating, I plugged in different values for that buffer size, and found that the search gets slower as the buffer expands:

quick-bench.com graph showing the behaviour when unmatching, with various input sizes

(small=100, large=35000, huge=100000; "unanchored" is with ^ omitted from the pattern)

The results are as I'd expect if I provide an input that actually matches the pattern:

quick-bench.com graph showing the behaviour when matching, with various input sizes

std::regex is not in multiline mode by default (right?), so I'd have thought that anchor (^) would have prevented such shenanigans. Pattern not found at the start of the search string? Return no match, job done.

Seems to happen with both libc++ and libstdc++.

What am I missing about this? What's causing it?

Obvious workarounds include giving std::regex_search just a prefix of the buffer (a prefix large enough to encompass all possible matches but no more than necessary), or just examining the string in some other way. But I'm curious.


FWIW, with a near-equivalent JavaScript testcase, Safari 12.0 on OSX works as I'd expect, with only a very small variation between the three searches (which I'm attributing to random factors) and no obvious pattern:

jsperf.com graph showing that JavaScript does what I'd expect


For the avoidance of doubt, I retested with a regex as simple as ^what and got the same results. Might update the above examples later for coherence if I can work up the motivation. :)

20
  • What the regex is supposed to match and what are those REGEX literals? Commented Jan 17, 2019 at 14:11
  • 1
    @revo: R"REGEX(...)REGEX" is a raw string literal with [arbitrarily chosen] delimiter "REGEX" - it allows me to write this regex without double-backslashing everything. Could also have been "^[\\x02-\\x7f]\\0..[\\x01-\\x0c]\\0..\\0\\0". Read more here Commented Jan 17, 2019 at 14:12
  • I'm not sure there is any shortcut when the inner machine state is all null. You could have an "or" without a ^ that would impede implementing any simple shorcut like that. Commented Jan 17, 2019 at 14:13
  • 2
    Interestingly visual studio 2017 does better, large is 6.5 times slower, huge is 16.9 times slower than small Commented Jan 17, 2019 at 15:00
  • 2
    plus one for effort alone, those are some great graphs (also nice mvce) Commented Jan 17, 2019 at 15:31

2 Answers 2

2

It's simply because libstdc++ and libc++ do not implement such optimization.

The following is the main part of libstdc++'s implementation of regex_search:

template<typename _BiIter, typename _Alloc, typename _TraitsT,
     bool __dfs_mode>
  bool _Executor<_BiIter, _Alloc, _TraitsT, __dfs_mode>::
  _M_search()
  {
    if (_M_search_from_first())
      return true;
    if (_M_flags & regex_constants::match_continuous)
      return false;
    _M_flags |= regex_constants::match_prev_avail;
    while (_M_begin != _M_end)
    {
      ++_M_begin;
      if (_M_search_from_first())
        return true;
    }
    return false;
  }

So it does traverse the whole range even if the regular expression contains ^.

So is libc++'s implementation.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your answer! Just to add, I'm not quite ready to classify this as an "optimization", so much as "a sensible way to do this". Is the engine really searching all 35,000 bytes for "a character that's at the start of the string"? That's crazy! Seems like the implementation needs to be able to fail-fast if it's to be fit for purpose. I'm tempted to raise bugs for this but would like people to confirm that I'm not being unreasonable first.
0

I am now considering this to be a quality of implementation issue that affects all three toolchains.

Bugs raised:

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.