Java RegExp - Extracting only numbers from a webpage

Question

I have a webpage converted to a string and I'm trying to extract three numbers from it from this line.

<td class="col_stat">1</td><td class="col_stat">0</td><td class="col_stat">1</td>

From the line above I already have it extracting the first '1' using this

String filePattern = "<td class=\"col_stat\">(.+)</td>";
    pattern = Pattern.compile(filePattern);
    matcher = pattern.matcher(text);
    if(matcher.find()){
        String number = matcher.group(1);
        System.out.println(number);
    }

Now what I want to do is extract the 0 and the last 1 but anytime I try edit the regular expression above it just outputs the complete webpage on the console. Anyone have any suggestions?? Thanks

Alan Moore · Accepted Answer · 2012-09-04 12:37:20Z

2

Regex matching is greedy, try this instead (looking only for (\d+) instead of (.+) (which matches everything until the last </td>):

String text = 
    "<td class=\"col_stat\">1</td>" + 
    "<td class=\"col_stat\">0</td>" + 
    "<td class=\"col_stat\">1</td>";
String filePattern = "<td class=\"col_stat\">(\\d+)</td>";
Pattern pattern = Pattern.compile(filePattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find())
{
    String number = matcher.group(1);
    System.out.println(number);
}

On a related note, I completely agree with other's suggestions to use a more structured approach to interpreting HTML.

edited Sep 4, 2012 at 12:37

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

answered Sep 4, 2012 at 11:45

Vikdor

24.2k10 gold badges66 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:27:36Z

2

Given that using regexps on HTML/XML is a notorious gotcha (see here for the definitive answer), I'd recommend doing this reliably using an HTML parser (e.g. JTidy - although it's a HTML pretty-printer, it also provides a DOM interface to the document)

edited May 23, 2017 at 12:27

CommunityBot

11 silver badge

answered Sep 4, 2012 at 11:41

Brian Agnew

273k38 gold badges342 silver badges443 bronze badges

Comments

jdevelop · Accepted Answer · 2012-09-04 11:45:41Z

1

<td class=\"col_stat\">(.+)</td>

this regex is greedy. If you wish to make it work with numbers - change it as:

<td class=\"col_stat\">(\\d+?)</td>

and I'd rather suggest to use XPath for such kind of matching, see Saxon and TagSoup

answered Sep 4, 2012 at 11:45

jdevelop

12.4k11 gold badges63 silver badges119 bronze badges

Comments

Marek Dec · Accepted Answer · 2012-09-04 11:49:05Z

0

This is because your matcher is greedy. You need a non-greedy matcher to fix this.

String text = "<td class=\"col_stat\">1</td><td class=\"col_stat\">0</td><td class=\"col_stat\">1</td>";

    String filePattern = "<td class=\"col_stat\">(.+?)</td>";
    Pattern pattern = Pattern.compile(filePattern);
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        String number = matcher.group(1);
        System.out.println(number);
    }

answered Sep 4, 2012 at 11:49

Marek Dec

9647 silver badges8 bronze badges

Comments

Philipp · Accepted Answer · 2012-09-04 11:50:01Z

0

Try this regular expression:

<td class="col_stat">(\d+)[^\d]+(\d+)[^\d]+(\d+)

This does the following:

search for your start string
select a chain of decimals
skip any NON-decimals
select a chain of decimals
skip any NON-decimals
select a chain of decimals

answered Sep 4, 2012 at 11:50

Philipp

70.1k10 gold badges121 silver badges159 bronze badges

Collectives™ on Stack Overflow

Java RegExp - Extracting only numbers from a webpage

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related