Regular Expressions Java, why is this regex so slow?

Question

I just created an regular expression in Java, I want to look for expressions in about 5000 tweets, each tweet takes almost one second, why is it so slow??

If it's too complex that expression or there're something on it that it's too expensive to execute? I'd hope to process the whole data in less than 5 seconds for sure.

The code is:

public class RegularExpression {
    public static void main(String[] args) throws IOException {                
        String filter = ".*\"created_at\":\"(.*?)\".*\"content\":\"(.*?word.*?)\",\"id\".*";       
        Pattern pattern = Pattern.compile(filter);
        List<String> tweets = FileUtils.readLines(new File("/tmp/tweets"));

        System.out.println("Start with " + tweets.size() );
        int i=0;
        for (String t : tweets){

            Matcher matcher = pattern.matcher(t);                      
            matcher.find();            
            System.out.println(i++);

        }
        System.out.println("End");
    }
}

The input are JSON tweets. If I do my RE simpler it runs faster, but, I think that my RE isn't so heavy. I'd like to understand why this's happenng, I was just checking a test.

UPDATED:

The reason why I'm using RE when I try to parse JSON, it's because in the end, I could get a simple text, and XML, a JSON format, a log from any kind of server. So, I have to work with my input like plain-text.

@Josay, no, it's not the same. Adding a question mark after a modifier (the asterisk here) makes the modifier non-greedy, i.e. it will match the shortest sequence instead of the longest. — Njol
– Njol, Commented Feb 3, 2014 at 13:48

Tim Pietzcker · Accepted Answer · 2014-02-03 13:48:53Z

2

Your regex is very imprecise in what it allows to match. Most importantly, you seem to be wanting to match text between quotes, but you're allowing quote characters to be part of the match (.* can and will happily match "!). This sets you up for a potentially very high number of permutations a regex engine has to check before declaring failure/success, depending on your input.

If in fact quotes may not be part of the text that you're currently matching with .*, then use [^"]* instead; that should speed it up a lot:

"[^\"]*\"created_at\":\"([^\"]*)\"[^\"]*\"content\":\"([^\"]*word[^\"]*)\",\"id\"[^\"]*"

answered Feb 3, 2014 at 13:48

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Guille Over a year ago

It works much better, I have to think a little bit about it and why is crating so many permutations if I look for .*

nhahtdh Over a year ago

@Guille: When the input has many "content" entries, then your regex will check every single one of them, while the regex above will only check the nearest one (since disallowing " means disallowing jumping over many "key":"value" entries). Well, there is also this part (.*?)\".*\" where the engine can match on any pairs of double quotes ".

Roland Illig · Accepted Answer · 2014-02-03 13:50:54Z

2

Since you already know that your input is JSON, you should not use regular expressions to interpret it. Use a JSON parser, then you don't have to care about anything like escaping special characters.

answered Feb 3, 2014 at 13:50

Roland Illig

41.9k12 gold badges92 silver badges127 bronze badges

Comments

ohaal · Accepted Answer · 2014-02-03 13:52:35Z

1

I'm not entirely sure why it takes almost a full second to process a single tweet, but lazy quantifiers are more expensive than a "match anything except" approach to a "match until"-scenario.

More information here: http://blog.stevenlevithan.com/archives/greedy-lazy-performance

You could try avoiding the use of lazy quantifiers, or just use a JSON parser instead, as it would likely be faster/cleaner.

answered Feb 3, 2014 at 13:52

ohaal

5,2682 gold badges37 silver badges56 bronze badges

Collectives™ on Stack Overflow

Regular Expressions Java, why is this regex so slow?

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related