2

The below mentioned RegEx perform very poorly on a very large string or more than 2000 Lines. Basically the Java String is composed of PL/SQL script.

1- Replace each occurrence of delimiting character, for example ||, != or > sign with a space before and after the characters. This takes infinite time and never ends, so no time can be recorded.

// Delimiting characters for SQLPlus
private static final String[] delimiters = { "\\|\\|", "=>", ":=", "!=", "<>", "<", ">", "\\(", "\\)", "!", ",", "\\+", "-", "=", "\\*", "\\|" };


for (int i = 0; i < delimiters.length; i++) {
    script = script.replaceAll(delimiters[i], " " + delimiters[i] + " ");
}

2- The following pattern looks for all occurances of forward slash / except the ones that are preceded by a *. That mean don't look for forward slash in a block comment syntax. This takes about 103 Seconds for a 2000 lines of String.

Pattern p = Pattern.compile("([^\\*])([\\/])([^\\*])");
Matcher m = p.matcher(script);
while (m.find()) {
    script = script.replaceAll(m.group(2), " " + m.group(2) + " ");
}

3- Remove any white spaces from within date or date format

Pattern p = Pattern.compile("(?i)(\\w{1,2}) +/ +(\\w{1,2}) +/ +(\\w{2,4})");
// Create a matcher with an input string
Matcher m = p.matcher(script);
while (m.find()) {
    part1 = script.substring(0, m.start());
    part2 = script.substring(m.end());
    script = part1 + m.group().replaceAll("[ \t]+", "") + part2;
    m = p.matcher(script);
}

Is there any way to optimize all the three RegEx so that they take less time?

Thanks

Ali

1
  • 2
    You should split up this question into three separate questions. This will also help you in creating more meaningful (and therefore interesting) question titles than "Optimizing several regexes"... Commented Nov 24, 2011 at 8:37

3 Answers 3

2

I'll answer the first question.

You can combine all this into a single regex replace operation:

script = script.replaceAll("\\|\\||=>|[:!]=|<>|[<>()!,+=*|-]", " $0 ");

Explanation:

\|\|            # Match ||
|               # or
=>              # =>
|               # or
[:!]=           # := or !=
|               # or
<>              # <>
|               # or
[<>()!,+=*|-]   # <, >, (, ), !, comma, +, =, *, | or -
Sign up to request clarification or add additional context in comments.

2 Comments

But will this be faster? I doubt it.
@FailedDev: I'm pretty sure it will be. The original solution would take a:=b||c, transform it into a:=b || c, then transform that into a := b || c, then into a : = b || c, then into a : = b | | c (I've only shown the steps where something actually happens, and I'm certain that's not what he intended to happen anyway). Lots of double spaces that are introduced by this don't show up here, but you get the general idea.
1

Sure. Your second approach is "almost" good. The problem is that you do not use your pattern for replacement itself. When you are using str.replaceAll() you actually creating Pattern instance every time you are calling this method. Pattern.compile() is called for you and it takes 90% of time.

You should use Matcher.replaceAll() instead.

    String script = "dfgafjd;fjfd;jfd;djf;jds\\fdfdf****\\/";
    String result = script;

    Pattern p = Pattern.compile("[\\*\\/\\\\]"); // write all characters you want to remove here.
    Matcher m = p.matcher(script);
    if (m.find()) {
        result = m.replaceAll("");
    }       
    System.out.println(result);

Comments

1

It isn't the regexes causing your performance problem, it's that fact that you're doing many passes over the text, and constantly creating new Pattern objects. And it's not just performance that suffers, as Tim pointed out; it's much too easy to mess up the results of prior passes when you do that.

In fact, I'm guessing that those extra spaces in the dates are just a side effect your other replacements. If so, here's a way you can do all the replacements in one pass, without adding unwanted characters:

static String doReplace(String input)
{
  String regex = 
      "/\\*[^*]*(?:\\*(?!/)[^*]*)*\\*/|"      // a comment
    + "\\b\\d{2}/\\d{2}/\\d{2,4}\\b|"         // a date
    + "(/|\\|\\||=>|[:!]=|<>|[<>()!,+=*|-])"; // an operator

  Matcher m = Pattern.compile(regex).matcher(input);
  StringBuffer sb = new StringBuffer();
  while (m.find())
  {
     // if we found an operator, replace it
    if (m.start(1) != -1)
    {
      m.appendReplacement(sb, " $1 ");
    }
  }
  m.appendTail(sb);
  return sb.toString();
}

see the online demo

The trick is, if you don't call appendReplacement(), the match position is not updated, so it's as if the match didn't occur. Because I ignore them, the comments and dates get reinserted along with the rest of the unmatched text, and I don't have to worry about matching the slash characters inside them.

EDIT Make sure the "comment" part of the regex comes before the "operator" part. Otherwise, the leading / of every comment will be treated as an operator.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.