In-place regex replacement using Boost

Question

I have a huge paragraph of text stored in an std::string named 'text'. On this string, I am replacing certain patterns with a white space using the boost regex library. Here is my code.

// Remove times of the form (00:33) and (1:33)
boost::regex rgx("\\([0-9.:]*\\)");
text = boost::regex_replace(text, rgx, " ");

// Remove single word HTML tags
rgx.set_expression("<[a-zA-Z/]*>");
text = boost::regex_replace(text, rgx, " ");

// Remove comments like [pause], [laugh]
rgx.set_expression("\\[[a-zA-Z]* *[a-zA-Z]*\\]");
text = boost::regex_replace(text, rgx, " ");

// Remove comments of the form <...>
rgx.set_expression("<.+?>");
text = boost::regex_replace(text, rgx, " ");

// Remove comments of the form {...}
rgx.set_expression("\\{.+?\\}");
text = boost::regex_replace(text, rgx, " ");

// Remove comments of the form [...]
rgx.set_expression("\\[.+?\\]");
text = boost::regex_replace(text, rgx, " ");

From my understanding, each time I run the regex_replace function, it creates an new string and writes the output to it. If I run the regex_replace function with N different patterns, it will allocate N new strings (deleting the old ones).

Since memory allocation is time consuming, is there a way to perform the replacement 'in-place', without allocating a new string?

user3920237 · Accepted Answer · 2015-02-14 04:14:09Z

1

regex_replace has two overloads, the one you're using right now, and another which takes iterators. You can specify the output iterator to be the same range you're operating on.

boost::regex_replace(text.begin(), text.begin(), text.end(), rgx, 
                     " ");

answered Feb 14, 2015 at 4:14

user3920237

Sign up to request clarification or add additional context in comments.

5 Comments

Nitish Satyavolu Over a year ago

Thanks, I think that will do it.

sehe Over a year ago

Warning: the result is likely undefined when the formatter replaces the match with a different length string. (!!!) (The documentation says nothing about aliased/overlapping input/output ranges)

WBuck Over a year ago

@sehe So, to be clear, if the string replacement is of a different length this could potentially be a dangerous operation?

sehe Over a year ago

@WBuck Like I said, the behaviour is unspecified ("the documentation says nothing about..."). I'd assume that it could crash, clone your in-laws, legally marry your pet. Of course if you look at the implementation you can work out that it is in fact safe if the replacement has equal length, or maybe even if is shorter, which has a slimmer chance already. I'm going to boldly claim that all other cases are pure and simple UB (the documentation would certainly mention it if the implementors made that effort)

Steve Lorimer Oct 3 at 7:58

this shouldn't be the accepted answer, @sehe is correct - the string gets corrupted if the replacement is a different length

Felix Dombek · Accepted Answer · 2017-03-28 18:17:15Z

0

As neither of your regex replacements processes the output of previous replacement steps, you can just put all of those regexes into one larger regex and run that one, once.

You could even specify different replacement strings for each regex part, but that isn't necessary here.

boost::regex rgx("(\\([0-9.:]*\\))|"
                 "(<[a-zA-Z/]*>)|"
                 "(\\[[a-zA-Z]* *[a-zA-Z]*\\])|"
                 "(<.+?>)|"
                 "(\\{.+?\\})|"
                 "(\\[.+?\\])");
text = boost::regex_replace(text, rgx, " ");

answered Mar 28, 2017 at 18:17

Felix Dombek

14.6k19 gold badges86 silver badges148 bronze badges

Collectives™ on Stack Overflow

In-place regex replacement using Boost

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related