Java, poor regex performance with lazy expressions

Question

The code is actually in Scala (Spark/Scala) but the library scala.util.matching.Regex, as per the documentation, delegates to java.util.regex.

The code, essentially, reads a bunch of regex from a config file and then matches them against logs fed to the Spark/Scala app. Everything worked fine until I added a regex to extract strings separated by tabs where the tab has been flattened to "#011" (by rsyslog). Since the strings can have white-spaces, my regex looks like: (.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)

The moment I add this regex to the list, the app takes forever to finish processing logs. To give you an idea of the magnitude of delay, a typical batch of a million lines takes less than 5 seconds to match/extract on my Spark cluster. If I add the expression above, a batch takes an hour!

In my code, I have tried a couple of ways to match regex:

if ( (regex findFirstIn log).nonEmpty ) { do something }
val allGroups = regex.findAllIn(log).matchData.toList if (allGroups.nonEmpty) { do something }
if (regex.pattern.matcher(log).matches()){do something}

All three suffer from poor performance when the regex mentioned above it added to the list of regex. Any suggestions to improve regex performance or change the regex itself?

The Q/A that's marked as duplicate has a link that I find hard to follow. It might be easier to follow the text if the referenced software, regexbuddy, was free or at least worked on Mac.

I tried negative lookahead but I can't figure out how to negate a string. Instead of /(.+?)#011/, something like /([^#011]+)/ but that just says negate "#" or "0" or "1". How do I negate "#011"? Even after that, I am not sure if negation will fix my performance issue.

AFAIK the regex is slow not because of the lazy/reluctant quantifiers, but because the underlying backtracking implementation is inherently slow. The real long-term fix is to change Java's engine to an NFA automaton matcher. In the meantime, for you as a workaround to the regex woes, how about using String.split() on "#011" and doing processing based on that? — Nayuki
– Nayuki, Commented May 15, 2015 at 19:35
For reference: swtch.com/~rsc/regexp/regexp1.html (exponential vs. polynomial time for regexes with many quantifiers) — Nayuki
– Nayuki, Commented May 15, 2015 at 19:37
@NayukiMinase Can't use split() because the code isn't specific to a specific log/string. The code/app takes a list of regex and for every log line fed to the app, tries to match a regex from the list. One approach I am considering is writing a regex function is C and calling it from Scala. — Joe Nate
– Joe Nate, Commented May 15, 2015 at 19:46
I reopened this question because I don't think non-greediness is necessarily the problem. You say you're applying the regex to one line at a time; why aren't you using anchors (^ and $)? Also, are there any other # symbols besides the ones in the flattened TABs? Maybe you could use ([^#]++) instead of (.+?). — Alan Moore
– Alan Moore, Commented May 15, 2015 at 19:53

Community · Accepted Answer · 2020-06-20 09:12:55Z

2

The simplest way would be to split on #011. If you want a regex, you can indeed negate the string, but that's complicated. I'd go for an atomic group

(?>(.+?)#011)

Once matched, there's no more backtracking. Done and looking forward for the next group.

Negating a string

The complement of #011 is anything not starting with a #, or starting with a # and not followed by a 0, or starting with the two and not followed... you know. I added some blanks for readability:

 ((?: [^#] | #[^0] | #0[^1] | #01[^1] )+) #011

Pretty terrible, isn't it? Unlike your original expression it matches newlines (you weren't specific about them).

An alternative is to use negative lookahead: (?!#011) matches iff the following chars are not #011, but doesn't eat anything, so we use a . to eat a single char:

 ((?: (?!#011). )+)#011

It's all pretty complicated and most probably less performant than simply using the atomic group.

Optimizations

Out of my above regexes, the first one is best. However, as Casimir et Hippolyte wrote, there's a room for improvements (factor 1.8)

( [^#]*+ (?: #(?!011) [^#]* )*+ ) #011

It's not as complicated as it looks. First match any number (including zero) of non-# atomically (the trailing +). Then match a # not followed by 011 and again any number of non-#. Repeat the last sentence any number of times.

A small problem with it is that it matches an empty sequence as well and I can't see an easy way to fix it.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered May 15, 2015 at 19:36

maaartinus

46.8k40 gold badges176 silver badges343 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Joe Nate Over a year ago

The regex for atomic group doesn't seem to work. Try to match/extract this string "(empty)#011".

maaartinus Over a year ago

@JoeNate Thanks, fixed. The problem was '<' (lookbehind-like nonsense) instead of `>'.

Joe Nate Over a year ago

You are a life saver, thanks a bunch :) I was trying "^". Now the performance landmine is gone! :) Post a "gift a beer" link, if you have one :D

Casimir et Hippolyte Over a year ago

When you use an advanced regex engine, there is no need to use subpatterns designed for posix like:(([^#] | #[^0] | #0[^1] | #01[^1] )+) #011 or this inefficient workaround: ((?: (?!#011). )+)#011 (note the number of useless tests with this way). You can use something like this: [^#]*+(?:#(?!011)[^#]*)*+ that is from far more performant. However the first way can be useful and is good to know for mysql, sed or grep.

maaartinus Over a year ago

@CasimiretHippolyte I can't see the number of useless tests, can you elaborate? What I see is "if not looking at a hash sign, then eat a char, else do something more complicated", which should be pretty fast. +++ I'm also lost concerning the far more performant solution as I'd go for my very first regex instead (my reason for the negation section was actually assuming I can't use atomic groups).

|

Collectives™ on Stack Overflow

Java, poor regex performance with lazy expressions

1 Answer 1

Negating a string

Optimizations

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Negating a string

Optimizations

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related