1

I just created an regular expression in Java, I want to look for expressions in about 5000 tweets, each tweet takes almost one second, why is it so slow??

If it's too complex that expression or there're something on it that it's too expensive to execute? I'd hope to process the whole data in less than 5 seconds for sure.

The code is:

public class RegularExpression {
    public static void main(String[] args) throws IOException {                
        String filter = ".*\"created_at\":\"(.*?)\".*\"content\":\"(.*?word.*?)\",\"id\".*";       
        Pattern pattern = Pattern.compile(filter);
        List<String> tweets = FileUtils.readLines(new File("/tmp/tweets"));

        System.out.println("Start with " + tweets.size() );
        int i=0;
        for (String t : tweets){

            Matcher matcher = pattern.matcher(t);                      
            matcher.find();            
            System.out.println(i++);

        }
        System.out.println("End");
    }
}

The input are JSON tweets. If I do my RE simpler it runs faster, but, I think that my RE isn't so heavy. I'd like to understand why this's happenng, I was just checking a test.

UPDATED:

The reason why I'm using RE when I try to parse JSON, it's because in the end, I could get a simple text, and XML, a JSON format, a log from any kind of server. So, I have to work with my input like plain-text.

3
  • 4
    @Josay, no, it's not the same. Adding a question mark after a modifier (the asterisk here) makes the modifier non-greedy, i.e. it will match the shortest sequence instead of the longest. Commented Feb 3, 2014 at 13:48
  • 2
    Why don't you use a JSON parser to parse the JSON first? Commented Feb 3, 2014 at 13:50
  • Use the right tool for the job. Commented Feb 3, 2014 at 14:10

3 Answers 3

2

Your regex is very imprecise in what it allows to match. Most importantly, you seem to be wanting to match text between quotes, but you're allowing quote characters to be part of the match (.* can and will happily match "!). This sets you up for a potentially very high number of permutations a regex engine has to check before declaring failure/success, depending on your input.

If in fact quotes may not be part of the text that you're currently matching with .*, then use [^"]* instead; that should speed it up a lot:

"[^\"]*\"created_at\":\"([^\"]*)\"[^\"]*\"content\":\"([^\"]*word[^\"]*)\",\"id\"[^\"]*"
Sign up to request clarification or add additional context in comments.

2 Comments

It works much better, I have to think a little bit about it and why is crating so many permutations if I look for .*
@Guille: When the input has many "content" entries, then your regex will check every single one of them, while the regex above will only check the nearest one (since disallowing " means disallowing jumping over many "key":"value" entries). Well, there is also this part (.*?)\".*\" where the engine can match on any pairs of double quotes ".
2

Since you already know that your input is JSON, you should not use regular expressions to interpret it. Use a JSON parser, then you don't have to care about anything like escaping special characters.

Comments

1

I'm not entirely sure why it takes almost a full second to process a single tweet, but lazy quantifiers are more expensive than a "match anything except" approach to a "match until"-scenario.

More information here: http://blog.stevenlevithan.com/archives/greedy-lazy-performance

You could try avoiding the use of lazy quantifiers, or just use a JSON parser instead, as it would likely be faster/cleaner.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.