.NET Regex question

Question

I'm trying to parse some data out of a website. The problem is that a javascript generates the data, thus I can't use a HTML parser for it. The string inside the source looks like:

<a href="http:www.domain.compid.php?id=123">

Everything is constant except the id that comes after the =. I don't know how many times the string will occur either. Would appreciate any help and an explanation on the regex example if possible.

If you can pass it to regex, why can't you pass it to a proper parser? — Jay
– Jay, Commented Feb 23, 2011 at 2:30
Because the source is screwed by the javascript unicode characters such as "\u003A" and HtmlAgilityPack does not work with javascript either. — regexnewb
– regexnewb, Commented Feb 23, 2011 at 2:32

jb. · Accepted Answer · 2011-02-23 03:18:36Z

2

Do you need to save any of it? A blanket regex href="[^"]+"> will match the entire string. If you need to save a specific part, let me know.

EDIT: To save the id, note the paren's after id= which signifies to capture it. Then to retrieve it, use the match object's Groups field.

string source = "a href=\"http:www.domain.compid.php?id=123\">";
Regex re = new Regex("href=\"[^\"]+id=([^\"]+)\">");

Match match = re.Match(source);
if(match.Success)
{
    Console.WriteLine("It's a match!\nI found:{0}", match.Groups[0].Value);
    Console.WriteLine("And the id is {0}", match.Groups[1].Value);
}

EDIT: example using MatchCollection

MatchCollection mc = re.Matches(source);

foreach(Match m in mc)
{
    //do the same as above. except use "m" instead of "match"
    //though you don't have to check for success in each m match object 
    //since it wouldn't have been added to the MatchCollection if it wasn't a match
}

edited Feb 23, 2011 at 3:18

answered Feb 23, 2011 at 2:30

jb.

10.4k12 gold badges57 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

regexnewb Over a year ago

@jb Hm thanks, that looks great. However I'm not sure how I am supposed to use the value as pattern because of the double quotes inside it. I thought by using the verbatim @ string but it doesn't compile(Invalid expression term ')' and same for ^

jb. Over a year ago

@regexnewb I updated my answer with an example. You would need to escape the " in the regex by doing \"

regexnewb Over a year ago

@jb thanks that works, is it possible to only parse out the value after id = ?

jb. Over a year ago

@regexnewb, sure, so in this case you want to save the "123", right?

regexnewb Over a year ago

@jb thanks, that works for a single match. If i would use a MatchCollection how would I be able to get the id from it? Since I have multiple links and I guess that I need to use a MatchCollection to "collect" them all.

|

Grastveit · Accepted Answer · 2011-02-23 12:07:42Z

This does the parsing in javascript and creates a csv-string:

var re = /<a href="http:www.domain.compid.php\?id=(\d+)">/;
var source = document.body.innerHTML;
var result = "result: ";

var match = re(source);
while (match != null) {
    result += match[1] + ",";
    source = source.substring(match.index + match[0].length);
    match = re(source);
}

Demo. If the html-content is not used for anything else on the server it should be sufficient to send the ids.

EDIT, For performance and reliability it's probably better to use builtin javascript-functions (or jQuery) to find the urls instead of searching the entire content:

var re = /www.domain.compid.php\?id=(\d+)/;
var as = document.getElementsByTagName('a');    
var result = "result: ";

for (var i = 0; i < as.length; i++) {
    var match = re(as[i].getAttribute('href'));
    if (match != null) {
        result += match[1] + ",";
    }
}

Collectives™ on Stack Overflow

.NET Regex question

2 Answers 2

8 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related