1

Here's an input HTML string:

<p>Johnny: My favorite color is pink<br />
Sarah: My favorite color is blue<br />
Johnny: Let's swap genders?<br />
Sarah: OK!<br />
</p>

I want to regex-match the bolded part above. Basically put, find any matches between ">" (or beginning of line) and ":"

I made this regex (?>)[^>](.+): but it didn't work correctly, it bolded the parts below, including the <p> tag. I don't want to match any HTML tag:

<p>Johnny: My favorite color is pink<br />
Sarah: My favorite color is blue<br />
Johnny: Let's swap genders?<br />
Sarah: OK!<br />
</p>

I am using Java, with code like this:

Matcher m = Pattern.compile("`(?>)[^>](.+):`", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL).matcher(string); 
4
  • 4
    In which language are you writing the regex ? You would be better off using a HTMLParser library/module available in the language of your choice than using handcrafted regexes to parse HTML. Commented May 25, 2011 at 16:37
  • Java. Your suggestion is good, I'll look into that. Thanks. Commented May 25, 2011 at 16:40
  • 8
    Obligitory HTML/Regex warning: stackoverflow.com/questions/1732348/… Commented May 25, 2011 at 16:42
  • 1
    And they still keep coming... Commented May 25, 2011 at 17:01

2 Answers 2

4

Following code should work:

String str = "<p>Johnny Smith: My favorite color abc: is pink<br />" +
"Sarah: My favorite color is dark: blue<br />" +
"Johnny: Let's swap: genders?<br />" +
"Sarah: OK: sure!<br />" +
"</p>";

Pattern p = Pattern.compile("(?:>|^)([\\w\\s]+)(?=:)", Pattern.MULTILINE);
Matcher m = p.matcher(str); 
while(m.find()){
    System.out.println(m.group(1));
}

OUTPUT

Johnny Smith
Sarah
Johnny
Sarah
Sign up to request clarification or add additional context in comments.

1 Comment

That almost worked. If we use the name 'Johnny Smith' instead of 'Johnny', then it won't match.
0

If you want a match when a word is followed by ':' then "\w+:" should be enough. But if you want to include the '>' possibility you can try:

        String s = "<p>Johnny: My favorite color is pink<br />" +
            "Sarah: My favorite color is blue<br />" +
            "Johnny: Let's swap genders?<br />" +
            "Sarah: OK!<br />" +
            "</p>";

    Pattern p = Pattern.compile("[>]?(\\w+):");
    Matcher m = p.matcher(s); 
    while(m.find()){
        System.out.println(m.start()+" : "+m.group(1));
    }

1 Comment

Thank you, but I saw a potential problem: If the name is "Johnny Smith", then only "Smith" is matched and not "Johnny Smith". Almost there! Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.