5

I have a huge paragraph of text stored in an std::string named 'text'. On this string, I am replacing certain patterns with a white space using the boost regex library. Here is my code.

// Remove times of the form (00:33) and (1:33)
boost::regex rgx("\\([0-9.:]*\\)");
text = boost::regex_replace(text, rgx, " ");

// Remove single word HTML tags
rgx.set_expression("<[a-zA-Z/]*>");
text = boost::regex_replace(text, rgx, " ");

// Remove comments like [pause], [laugh]
rgx.set_expression("\\[[a-zA-Z]* *[a-zA-Z]*\\]");
text = boost::regex_replace(text, rgx, " ");

// Remove comments of the form <...>
rgx.set_expression("<.+?>");
text = boost::regex_replace(text, rgx, " ");

// Remove comments of the form {...}
rgx.set_expression("\\{.+?\\}");
text = boost::regex_replace(text, rgx, " ");

// Remove comments of the form [...]
rgx.set_expression("\\[.+?\\]");
text = boost::regex_replace(text, rgx, " ");

From my understanding, each time I run the regex_replace function, it creates an new string and writes the output to it. If I run the regex_replace function with N different patterns, it will allocate N new strings (deleting the old ones).

Since memory allocation is time consuming, is there a way to perform the replacement 'in-place', without allocating a new string?

0

2 Answers 2

1

regex_replace has two overloads, the one you're using right now, and another which takes iterators. You can specify the output iterator to be the same range you're operating on.

boost::regex_replace(text.begin(), text.begin(), text.end(), rgx, 
                     " ");
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks, I think that will do it.
Warning: the result is likely undefined when the formatter replaces the match with a different length string. (!!!) (The documentation says nothing about aliased/overlapping input/output ranges)
@sehe So, to be clear, if the string replacement is of a different length this could potentially be a dangerous operation?
@WBuck Like I said, the behaviour is unspecified ("the documentation says nothing about..."). I'd assume that it could crash, clone your in-laws, legally marry your pet. Of course if you look at the implementation you can work out that it is in fact safe if the replacement has equal length, or maybe even if is shorter, which has a slimmer chance already. I'm going to boldly claim that all other cases are pure and simple UB (the documentation would certainly mention it if the implementors made that effort)
this shouldn't be the accepted answer, @sehe is correct - the string gets corrupted if the replacement is a different length
0

As neither of your regex replacements processes the output of previous replacement steps, you can just put all of those regexes into one larger regex and run that one, once.

You could even specify different replacement strings for each regex part, but that isn't necessary here.

boost::regex rgx("(\\([0-9.:]*\\))|"
                 "(<[a-zA-Z/]*>)|"
                 "(\\[[a-zA-Z]* *[a-zA-Z]*\\])|"
                 "(<.+?>)|"
                 "(\\{.+?\\})|"
                 "(\\[.+?\\])");
text = boost::regex_replace(text, rgx, " ");

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.