I just created an regular expression in Java, I want to look for expressions in about 5000 tweets, each tweet takes almost one second, why is it so slow??
If it's too complex that expression or there're something on it that it's too expensive to execute? I'd hope to process the whole data in less than 5 seconds for sure.
The code is:
public class RegularExpression {
public static void main(String[] args) throws IOException {
String filter = ".*\"created_at\":\"(.*?)\".*\"content\":\"(.*?word.*?)\",\"id\".*";
Pattern pattern = Pattern.compile(filter);
List<String> tweets = FileUtils.readLines(new File("/tmp/tweets"));
System.out.println("Start with " + tweets.size() );
int i=0;
for (String t : tweets){
Matcher matcher = pattern.matcher(t);
matcher.find();
System.out.println(i++);
}
System.out.println("End");
}
}
The input are JSON tweets. If I do my RE simpler it runs faster, but, I think that my RE isn't so heavy. I'd like to understand why this's happenng, I was just checking a test.
UPDATED:
The reason why I'm using RE when I try to parse JSON, it's because in the end, I could get a simple text, and XML, a JSON format, a log from any kind of server. So, I have to work with my input like plain-text.