2

I'm trying to implement a simple REGEX that allows me to capture some info within a XML.

However, my REGEX capture several tags and gives me a very long answer. For example, If I have something like:

<item>
<title>bla</title>
...
<description>bla</description>
</item>
<item>
<title>bla2</title>
....
<description>bla2, keyword here are blablabla</description>
</item>

However, I use a REGEX like:

<item><title>([\\p{L}\\p{N}\\W \\.\\,]*?)</title>.*?<description>[\\p{L}\\p{N} \\.\\,]keyword[\\p{L}\\p{N} \\.\\,]*</description>

There are tags between title and description. When I use that REGEX it gives me all the tags until the first time it finds the word "keyword". So, the problem is this line:

</title>.*?<description>

How can I tell my REGEX that if the first description tag it finds doesn't have the keyword, it should select the next tag and return the result from the second item tag. Or, that it should not look for all the data between the title tag and the description tag if there is an ending item tag between those two.

I hope I'm explaining myself clearly. Please, ask for clarification if needed.

Edit:

An alternative solution:

 <item><title>([\\p{L}\\p{N}\\W \\.\\,]*?)</title>(?:(?!<item>).)*?<description>[\\p{L}\\p{N} \\.\\,]keyword[\\p{L}\\p{N} \\.\\,]*</description>

Using (?:(?!).)* as a negative lookahead to avoid the capture of strings within new items.

6
  • 1
    Why parse XML with regex? isn't it more saver to use XML parser ? Use the right tool for the right job ? Commented Sep 2, 2015 at 5:07
  • It's not a personal choice. It's for an academic exercise. Commented Sep 2, 2015 at 5:12
  • 1
    I'm facing a dilemma... should I link to that question or not? Commented Sep 2, 2015 at 5:12
  • You have an academic exercise that tells you that you must use a tool (regexes) for a job it's unsuited for? Not sure I understand. In any case, if there is a regex that does what you want, it will certainly not be "simple" which is what you said you wanted. Commented Sep 2, 2015 at 5:16
  • Yes, the purpose is to obtain data from RSSs using REGEX. In this particular exercise, I need to filter news by a keyword in the description tag. I'm so close to find the answer... Commented Sep 2, 2015 at 5:23

1 Answer 1

1

What about this regex?

(<item>[^<]*?<title>(?<title>[^<]*?)<\/title>([^<]|<(?!description))*<description>(?<desc>[^<]*?keyword[^<]*?)<\/description>[^<]*?<\/item>)

It matches every item and capture description and title. After that you could loop over the matches and find the item which contains your keyword.

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
      String sourcestring = "source string to match with pattern";
      Pattern re = Pattern.compile("(<item>[^<]*?<title>(?<title>[^<]*?)<\\/title>([^<]|<(?!description))*<description>(?<desc>[^<]*?keyword[^<]*?)<\\/description>[^<]*?<\\/item>)",Pattern.DOTALL);
      Matcher m = re.matcher(sourcestring);
      int mIdx = 0;
      while (m.find()){ 
          for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
            System.out.println( "[" + mIdx + "][" + groupIdx + "] = " +    m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

You can find the results for your example data over here: https://regex101.com/r/gA3nR4/4

Sign up to request clarification or add additional context in comments.

9 Comments

Yes, it's possible. But the exercise requires to use REGEX to find the matching keyword, so, I cannot use it.
@JuanPReyes what exactly do you want to match? "to find the matching keyword" - you can't find the keyword without knowing it. Do you mean "to find the item or items title, whose description contains the keyword"?
I know beforehand the keyword, so yes, I need to return the title of the new whose description contains the keyword.
@JuanPReyes I've updated the regex/my answer. Please check if it fit's your needs.
Unfortunately, it doesn't. As I commented before, the "..." on the text represents several other tags. I forked the regex you used to make some tests of my own, this is my current problem: regex101.com/r/gA3nR4/1 The regex choose the whole text instead of just the text between the second item tag.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.