Java REGEX Capturing too much

Question

I'm trying to implement a simple REGEX that allows me to capture some info within a XML.

However, my REGEX capture several tags and gives me a very long answer. For example, If I have something like:

<item>
<title>bla</title>
...
<description>bla</description>
</item>
<item>
<title>bla2</title>
....
<description>bla2, keyword here are blablabla</description>
</item>

However, I use a REGEX like:

<item><title>([\\p{L}\\p{N}\\W \\.\\,]*?)</title>.*?<description>[\\p{L}\\p{N} \\.\\,]keyword[\\p{L}\\p{N} \\.\\,]*</description>

There are tags between title and description. When I use that REGEX it gives me all the tags until the first time it finds the word "keyword". So, the problem is this line:

</title>.*?<description>

How can I tell my REGEX that if the first description tag it finds doesn't have the keyword, it should select the next tag and return the result from the second item tag. Or, that it should not look for all the data between the title tag and the description tag if there is an ending item tag between those two.

I hope I'm explaining myself clearly. Please, ask for clarification if needed.

Edit:

An alternative solution:

 <item><title>([\\p{L}\\p{N}\\W \\.\\,]*?)</title>(?:(?!<item>).)*?<description>[\\p{L}\\p{N} \\.\\,]keyword[\\p{L}\\p{N} \\.\\,]*</description>

Using (?:(?!).)* as a negative lookahead to avoid the capture of strings within new items.

Why parse XML with regex? isn't it more saver to use XML parser ? Use the right tool for the right job ? — Ferdinand Neman
– Ferdinand Neman, Commented Sep 2, 2015 at 5:07
I'm facing a dilemma... should I link to that question or not? — ajb
– ajb, Commented Sep 2, 2015 at 5:12
You have an academic exercise that tells you that you must use a tool (regexes) for a job it's unsuited for? Not sure I understand. In any case, if there is a regex that does what you want, it will certainly not be "simple" which is what you said you wanted. — ajb
– ajb, Commented Sep 2, 2015 at 5:16
Yes, the purpose is to obtain data from RSSs using REGEX. In this particular exercise, I need to filter news by a keyword in the description tag. I'm so close to find the answer... — Juan P Reyes
– Juan P Reyes, Commented Sep 2, 2015 at 5:23

netblognet · Accepted Answer · 2015-09-03 06:22:06Z

1

What about this regex?

(<item>[^<]*?<title>(?<title>[^<]*?)<\/title>([^<]|<(?!description))*<description>(?<desc>[^<]*?keyword[^<]*?)<\/description>[^<]*?<\/item>)

It matches every item and capture description and title. After that you could loop over the matches and find the item which contains your keyword.

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
      String sourcestring = "source string to match with pattern";
      Pattern re = Pattern.compile("(<item>[^<]*?<title>(?<title>[^<]*?)<\\/title>([^<]|<(?!description))*<description>(?<desc>[^<]*?keyword[^<]*?)<\\/description>[^<]*?<\\/item>)",Pattern.DOTALL);
      Matcher m = re.matcher(sourcestring);
      int mIdx = 0;
      while (m.find()){ 
          for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
            System.out.println( "[" + mIdx + "][" + groupIdx + "] = " +    m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

You can find the results for your example data over here: https://regex101.com/r/gA3nR4/4

edited Sep 3, 2015 at 6:22

answered Sep 2, 2015 at 6:30

netblognet

2,0263 gold badges26 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Juan P Reyes Over a year ago

Yes, it's possible. But the exercise requires to use REGEX to find the matching keyword, so, I cannot use it.

netblognet Over a year ago

@JuanPReyes what exactly do you want to match? "to find the matching keyword" - you can't find the keyword without knowing it. Do you mean "to find the item or items title, whose description contains the keyword"?

Juan P Reyes Over a year ago

I know beforehand the keyword, so yes, I need to return the title of the new whose description contains the keyword.

netblognet Over a year ago

@JuanPReyes I've updated the regex/my answer. Please check if it fit's your needs.

Juan P Reyes Over a year ago

Unfortunately, it doesn't. As I commented before, the "..." on the text represents several other tags. I forked the regex you used to make some tests of my own, this is my current problem: regex101.com/r/gA3nR4/1 The regex choose the whole text instead of just the text between the second item tag.

|

Collectives™ on Stack Overflow

Java REGEX Capturing too much

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related