1

I have a webpage converted to a string and I'm trying to extract three numbers from it from this line.

<td class="col_stat">1</td><td class="col_stat">0</td><td class="col_stat">1</td>

From the line above I already have it extracting the first '1' using this

String filePattern = "<td class=\"col_stat\">(.+)</td>";
    pattern = Pattern.compile(filePattern);
    matcher = pattern.matcher(text);
    if(matcher.find()){
        String number = matcher.group(1);
        System.out.println(number);
    }       

Now what I want to do is extract the 0 and the last 1 but anytime I try edit the regular expression above it just outputs the complete webpage on the console. Anyone have any suggestions?? Thanks

0

5 Answers 5

2

Regex matching is greedy, try this instead (looking only for (\d+) instead of (.+) (which matches everything until the last </td>):

String text = 
    "<td class=\"col_stat\">1</td>" + 
    "<td class=\"col_stat\">0</td>" + 
    "<td class=\"col_stat\">1</td>";
String filePattern = "<td class=\"col_stat\">(\\d+)</td>";
Pattern pattern = Pattern.compile(filePattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find())
{
    String number = matcher.group(1);
    System.out.println(number);
}

On a related note, I completely agree with other's suggestions to use a more structured approach to interpreting HTML.

Sign up to request clarification or add additional context in comments.

Comments

2

Given that using regexps on HTML/XML is a notorious gotcha (see here for the definitive answer), I'd recommend doing this reliably using an HTML parser (e.g. JTidy - although it's a HTML pretty-printer, it also provides a DOM interface to the document)

Comments

1
<td class=\"col_stat\">(.+)</td>

this regex is greedy. If you wish to make it work with numbers - change it as:

<td class=\"col_stat\">(\\d+?)</td>

and I'd rather suggest to use XPath for such kind of matching, see Saxon and TagSoup

Comments

0

This is because your matcher is greedy. You need a non-greedy matcher to fix this.

String text = "<td class=\"col_stat\">1</td><td class=\"col_stat\">0</td><td class=\"col_stat\">1</td>";

    String filePattern = "<td class=\"col_stat\">(.+?)</td>";
    Pattern pattern = Pattern.compile(filePattern);
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        String number = matcher.group(1);
        System.out.println(number);
    }

Comments

0

Try this regular expression:

<td class="col_stat">(\d+)[^\d]+(\d+)[^\d]+(\d+)

This does the following:

  1. search for your start string
  2. select a chain of decimals
  3. skip any NON-decimals
  4. select a chain of decimals
  5. skip any NON-decimals
  6. select a chain of decimals

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.