1

I'm trying to manipulate a String in Java to recognize the markdown options in Facebook Messenger.

I tested the RegEx in a couple of online testers and it worked, but when I tried to implement in Java, it's only recognizing text surrounded by underscores. I have an example that shows the problem here:

    private String process(String input) {
        String processed = input.replaceAll("(\\b|^)\\_(.*)\\_(\\b|$)", "underscore")
            .replaceAll("(\\b|^)\\*(.*)\\*(\\b|$)", "star")
            .replaceAll("(\\b|^)```(.*)```(\b|$)", "backticks")
            .replaceAll("(\\b|^)\\~(.*)\\~(\\b|$)", "tilde")
            .replaceAll("(\\b|^)\\`(.*)\\`(\\b|$)", "tick")
            .replaceAll("(\\b|^)\\\\\\((.*)\\\\\\)(\\b|$)", "backslashparen")
            .replaceAll("\\*", "%");  // am I matching stars wrong?

    return processed;
}


public void test() {
    String example = "_Text_\n" +
            "*text*\n" +
            "~Text~\n" +
            "`Text`\n" +
            "_Text_\n" +     // is it only matching the first one?
            "``` Text ```\n" +
            "\\(Text\\)\n" +
            "~Text~\n";
    System.out.println(process(example));
}

I expect all the lines would match and be replaced, but only the first line was matched. I wondered if it was because it was the first line, so I copied it in the middle and it matched both. Then I figured I might have missed something matching the special characters, so I added the snip to match the astericks and replace with a percent sign and it worked. The output I'm getting is like so:

underscore
%text%
~Text~
`Text`
underscore
``` Text ```
\(Text\)
~Text~

Any ideas what I might be missing?

Thanks.

3
  • 1
    Prefix all your regex with (?m) to enable MULTILINE match. Commented May 15, 2020 at 20:58
  • 1
    So I used your suggestion and it works as I intend, so thanks. but what I don't understand is that the very first one (underlines) matches at the beginning and in the middle. Can you point me to a reference why this works? Thanks again! Commented May 15, 2020 at 21:25
  • Doesnt make sense to use any boundry. Try take out both ^$ and \b. And not use \B becuas if accept a( shuld acaept a_ don't be control by boundry Commented May 15, 2020 at 22:17

1 Answer 1

1

If you're using word boundaries then there is no need to match anchors in alternation because word boundary also matches start and end positions. So this are actually redundant matches:

(?:^|\b)
(?:\b|$)

and both can be just be replaced by \b.

However looking at your regex please note that only underscore is considered a word character and *, ~, ` are not word characters hence \b cannot be used around those characters instead \B should be used which is inverse of \b.

Besides this some more improvements can be done like using a negated character class instead of greedy .* and removing unnecessary group.

Code:

class MyRegex {
    public static void main (String[] args) {
        String example = "_Text_\n" +
                "*text*\n" +
                "~Text~\n" +
                "`Text`\n" +
                "_Text_\n" +     // is it only matching the first one?
                "``` Text ```\n" +
                "\\(Text\\)\n" +
                "~Text~\n";
        System.out.println(process(example));
    }

    private static String process(String input) {
        String processed = input.replaceAll("\\b_[^_]+_\\b", "underscore")
            .replaceAll("\\B\\*[^*]+\\*\\B", "star")
            .replaceAll("\\B```.+?```\\B", "backticks")
            .replaceAll("\\B~[^~]+~\\B", "tilde")
            .replaceAll("\\B`[^`]+`\\B", "tick")
            .replaceAll("\\B\\\\\\(.*?\\\\\\)\\B", "backslashparen");

        return processed;
    }
}

Code Demo

Sign up to request clarification or add additional context in comments.

2 Comments

Clearly you're way better on the regex nuances than me. Thanks for the assist! Quick question, what are the question marks for? Wouldn't .*? be redundant? (and wouldn't .+? be the same as .* ? (<-- that is the end of my question, not part of the regex ....). Thanks again.
.+? is slightly more efficient that .+ because of lazy or non-greedy matcher and if you don't want to match empty string then .+? will be more efficient than .*?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.