1

The code is actually in Scala (Spark/Scala) but the library scala.util.matching.Regex, as per the documentation, delegates to java.util.regex.

The code, essentially, reads a bunch of regex from a config file and then matches them against logs fed to the Spark/Scala app. Everything worked fine until I added a regex to extract strings separated by tabs where the tab has been flattened to "#011" (by rsyslog). Since the strings can have white-spaces, my regex looks like: (.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)

The moment I add this regex to the list, the app takes forever to finish processing logs. To give you an idea of the magnitude of delay, a typical batch of a million lines takes less than 5 seconds to match/extract on my Spark cluster. If I add the expression above, a batch takes an hour!

In my code, I have tried a couple of ways to match regex:

  1. if ( (regex findFirstIn log).nonEmpty ) { do something }

  2. val allGroups = regex.findAllIn(log).matchData.toList if (allGroups.nonEmpty) { do something }

  3. if (regex.pattern.matcher(log).matches()){do something}

All three suffer from poor performance when the regex mentioned above it added to the list of regex. Any suggestions to improve regex performance or change the regex itself?

The Q/A that's marked as duplicate has a link that I find hard to follow. It might be easier to follow the text if the referenced software, regexbuddy, was free or at least worked on Mac.

I tried negative lookahead but I can't figure out how to negate a string. Instead of /(.+?)#011/, something like /([^#011]+)/ but that just says negate "#" or "0" or "1". How do I negate "#011"? Even after that, I am not sure if negation will fix my performance issue.

8
  • Why do you repeat #011(.+?) multiple times ? Commented May 15, 2015 at 18:23
  • AFAIK the regex is slow not because of the lazy/reluctant quantifiers, but because the underlying backtracking implementation is inherently slow. The real long-term fix is to change Java's engine to an NFA automaton matcher. In the meantime, for you as a workaround to the regex woes, how about using String.split() on "#011" and doing processing based on that? Commented May 15, 2015 at 19:35
  • For reference: swtch.com/~rsc/regexp/regexp1.html (exponential vs. polynomial time for regexes with many quantifiers) Commented May 15, 2015 at 19:37
  • @NayukiMinase Can't use split() because the code isn't specific to a specific log/string. The code/app takes a list of regex and for every log line fed to the app, tries to match a regex from the list. One approach I am considering is writing a regex function is C and calling it from Scala. Commented May 15, 2015 at 19:46
  • 1
    I reopened this question because I don't think non-greediness is necessarily the problem. You say you're applying the regex to one line at a time; why aren't you using anchors (^ and $)? Also, are there any other # symbols besides the ones in the flattened TABs? Maybe you could use ([^#]++) instead of (.+?). Commented May 15, 2015 at 19:53

1 Answer 1

2

The simplest way would be to split on #011. If you want a regex, you can indeed negate the string, but that's complicated. I'd go for an atomic group

(?>(.+?)#011)

Once matched, there's no more backtracking. Done and looking forward for the next group.

Negating a string

The complement of #011 is anything not starting with a #, or starting with a # and not followed by a 0, or starting with the two and not followed... you know. I added some blanks for readability:

 ((?: [^#] | #[^0] | #0[^1] | #01[^1] )+) #011

Pretty terrible, isn't it? Unlike your original expression it matches newlines (you weren't specific about them).

An alternative is to use negative lookahead: (?!#011) matches iff the following chars are not #011, but doesn't eat anything, so we use a . to eat a single char:

 ((?: (?!#011). )+)#011

It's all pretty complicated and most probably less performant than simply using the atomic group.

Optimizations

Out of my above regexes, the first one is best. However, as Casimir et Hippolyte wrote, there's a room for improvements (factor 1.8)

( [^#]*+ (?: #(?!011) [^#]* )*+ ) #011

It's not as complicated as it looks. First match any number (including zero) of non-# atomically (the trailing +). Then match a # not followed by 011 and again any number of non-#. Repeat the last sentence any number of times.

A small problem with it is that it matches an empty sequence as well and I can't see an easy way to fix it.

Sign up to request clarification or add additional context in comments.

8 Comments

The regex for atomic group doesn't seem to work. Try to match/extract this string "(empty)#011".
@JoeNate Thanks, fixed. The problem was '<' (lookbehind-like nonsense) instead of `>'.
You are a life saver, thanks a bunch :) I was trying "^". Now the performance landmine is gone! :) Post a "gift a beer" link, if you have one :D
When you use an advanced regex engine, there is no need to use subpatterns designed for posix like:(([^#] | #[^0] | #0[^1] | #01[^1] )+) #011 or this inefficient workaround: ((?: (?!#011). )+)#011 (note the number of useless tests with this way). You can use something like this: [^#]*+(?:#(?!011)[^#]*)*+ that is from far more performant. However the first way can be useful and is good to know for mysql, sed or grep.
@CasimiretHippolyte I can't see the number of useless tests, can you elaborate? What I see is "if not looking at a hash sign, then eat a char, else do something more complicated", which should be pretty fast. +++ I'm also lost concerning the far more performant solution as I'd go for my very first regex instead (my reason for the negation section was actually assuming I can't use atomic groups).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.