8

I am new to C++ and have to process a text File. I decided to do this with a Regex. The Regex I came up with:

(([^\\s^=]+)\\s*=\\s*)?\"?([^\"^\\s^;]+)\"?\\s*;[!?](\\w+)\\s*

I have written my C++ code according to the following Post:

c++ regex extract all substrings using regex_search()

Here is the C++ Code:

#include "pch.h"
#include <iostream>
#include <fstream>
#include <string>
#include <regex>
#include <chrono>
#include <iterator>

void print(std::smatch match)
{
}

int main()
{
    std::ifstream file{ "D:\\File.txt" };
    std::string fileData{};

    file.seekg(0, std::ios::end);
    fileData.reserve(file.tellg());
    file.seekg(0, std::ios::beg);

    fileData.assign(std::istreambuf_iterator<char>(file), 
    std::istreambuf_iterator<char>());

    static const std::string pattern{ "(([^\\s^=]+)\\s*=\\s*)?\"? 
    ([^\"^\\s^;]+)\"?\\s*;[!?](\\w+)\\s*" };
    std::regex reg{ pattern };
    std::sregex_iterator iter(fileData.begin(), fileData.end(), reg);
    std::sregex_iterator end;

    const auto before = std::chrono::high_resolution_clock::now();

    std::for_each(iter, end, print);

    const auto after = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double, std::milli> delta = after - before;
    std::cout << delta.count() << "ms\n";

    file.close();
}

The file I am processing contains 541 lines. The Program above needs 5 SECONDS to get all the 507 matches. I have done things like this before in C# and never had a Regex this slow. So I tried the same thing in C#:

var filedata = File.ReadAllText("D:\\File.txt", Encoding.Default);

const string regexPattern = 
    "(([^\\s^=]+)\\s*=\\s*)?\"?([^\"^\\s^;]+)\"?\\s*;[!?](\\w+)\\s*";

var regex = new Regex(regexPattern, RegexOptions.Multiline |      
    RegexOptions.Compiled );
    var matches = regex.Matches(filedata);

foreach (Match match in matches)
{
    Console.WriteLine(match.Value);
}

This needs only 500 MILLISECONDS to find all 507 matches + printing it on the Console. Since I have to work with C++ I need to be faster.

How can I make my C++ Program faster? What do I do wrong?

17
  • 2
    How did you compile your programm? Did you use -O2 or -O3? What is your compiler version? Commented Oct 8, 2018 at 7:22
  • 3
    You also should take a look at raw string literals (#6), so you don't have to escape your regex (which is very error-prone imho) Commented Oct 8, 2018 at 7:24
  • 8
    Please confirm that you did not benchnmark a debug executable. They can be magnitudes slower, esp MSVC and iterators. Commented Oct 8, 2018 at 7:42
  • 5
    You are using the Debug build. On the top of the screen, there should be a drop-down called "Debug", when you click on that, you can change that to "Release" Commented Oct 8, 2018 at 7:46
  • 6
    @MasterR8, Even leaving compiler optimizations out, debug builds often have extra checks that help to catch bugs. For example, Microsoft's iterators will do constant checking that they are valid and dereferenceable, even if a pointer would suffice as an iterator otherwise. It's a lot slower, but it's a lot better at telling you when you're doing something wrong instead of possibly having the issue go unnoticed for an arbitrary period of time. Commented Oct 8, 2018 at 7:53

2 Answers 2

3

I just encountered the same problem, finally I replaced std::regex with boost::regex, I think you may also try another regex library(boost/google re2...).

Update: I am using GCC 5.4.

Sign up to request clarification or add additional context in comments.

Comments

3

I mentioned C-style string but I didn't mean that using pure C-style as much as we can could improve the performance of regex. What I meant is simply that we should be able to do more things at compile time and to do less operations on std::string at runtime so that we can get a better performance.

Original Answer

As you know, we have two types of string: C-style string and std::string.

std::string is OK to be used while we coding but we can say that it's a kind of heavy stuff (this is why some people dislike it). First, std::string used the heap memory meaning that it used new or malloc; Second, std::string used some kind of specific algorithm while its size increasing (double the current size and move the current content into the new memory zone and free the old). These would cause the performance issues.

The Regular Expression is obviously all about string, it needs to play with string everywhere. std::regex plays a lot with std::string, that's why the performance of std::regex is not good.

Furthermore, std::string is totally a runtime stuff. For example, you can initialize a C-style string at compile time but you can't initialize a std::string at compile time. This means that things about std::string can rarely be optimized at compile time. Here is an example about how we can utilize the compile time to get a very good performance at runtime: Why is initialization of variable with constexpr evaluated at runtime instead of at compile time

std::regex can't do much thing at compile time because it uses std::string, this could cause the performance issue too. And this is why people may like CTRE (compile time regular expression) libraries.

If std::string_view can be used in the source code of std::regex, I think the performance would be better.

C++ promise: “don’t pay for what you don’t use.” No offense to anyone, but I personally kind of think that std::regex broke the promise. If C++ standards committee believes that function is more important than any other things, I would say why don't use Java.

4 Comments

No idea why down vote. boost::regex used much less new and less std::string, that's why it has a much better performance. Take a c++ profiler to verify what I said.
C-style strings are missing something rather important: their length. Certainly allocations are expensive, and algorithms that create and destroy std::string objects without a thought could easily end up doing a lot of allocation, but creating and destroying C-style strings will certainly involve a lot of allocation, and malloc is not cheaper than new. There's no point in using footguns like C-style strings for this. Instead, postpone creating strings until the appropriate moment.
... or maybe just hoist the actual reason for using C string literals to the top of your answer: it does make sense that the compiler would have an easier time evaluating constexpr code involving them at compile time, and strlen is obviously quite speedy then, too.
@SamB I know what you meant. I just added a postscript into this answer. I'm not an expert on C++ and didn't read the source code of C++ std library. So I can't make it more clear. What we've already known is that there must be some ways to improve the performance of std::regex because at least boost::regex has been there for a long time which has a very good performance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.