1

I'm trying to parse some data out of a website. The problem is that a javascript generates the data, thus I can't use a HTML parser for it. The string inside the source looks like:

<a href="http:www.domain.compid.php?id=123">

Everything is constant except the id that comes after the =. I don't know how many times the string will occur either. Would appreciate any help and an explanation on the regex example if possible.

2
  • 1
    If you can pass it to regex, why can't you pass it to a proper parser? Commented Feb 23, 2011 at 2:30
  • Because the source is screwed by the javascript unicode characters such as "\u003A" and HtmlAgilityPack does not work with javascript either. Commented Feb 23, 2011 at 2:32

2 Answers 2

2

Do you need to save any of it? A blanket regex href="[^"]+"> will match the entire string. If you need to save a specific part, let me know.

EDIT: To save the id, note the paren's after id= which signifies to capture it. Then to retrieve it, use the match object's Groups field.

string source = "a href=\"http:www.domain.compid.php?id=123\">";
Regex re = new Regex("href=\"[^\"]+id=([^\"]+)\">");

Match match = re.Match(source);
if(match.Success)
{
    Console.WriteLine("It's a match!\nI found:{0}", match.Groups[0].Value);
    Console.WriteLine("And the id is {0}", match.Groups[1].Value);
}

EDIT: example using MatchCollection

MatchCollection mc = re.Matches(source);

foreach(Match m in mc)
{
    //do the same as above. except use "m" instead of "match"
    //though you don't have to check for success in each m match object 
    //since it wouldn't have been added to the MatchCollection if it wasn't a match
}
Sign up to request clarification or add additional context in comments.

8 Comments

@jb Hm thanks, that looks great. However I'm not sure how I am supposed to use the value as pattern because of the double quotes inside it. I thought by using the verbatim @ string but it doesn't compile(Invalid expression term ')' and same for ^
@regexnewb I updated my answer with an example. You would need to escape the " in the regex by doing \"
@jb thanks that works, is it possible to only parse out the value after id = ?
@regexnewb, sure, so in this case you want to save the "123", right?
@jb thanks, that works for a single match. If i would use a MatchCollection how would I be able to get the id from it? Since I have multiple links and I guess that I need to use a MatchCollection to "collect" them all.
|
0

This does the parsing in javascript and creates a csv-string:

var re = /<a href="http:www.domain.compid.php\?id=(\d+)">/;
var source = document.body.innerHTML;
var result = "result: ";

var match = re(source);
while (match != null) {
    result += match[1] + ",";
    source = source.substring(match.index + match[0].length);
    match = re(source);
}

Demo. If the html-content is not used for anything else on the server it should be sufficient to send the ids.

EDIT, For performance and reliability it's probably better to use builtin javascript-functions (or jQuery) to find the urls instead of searching the entire content:

var re = /www.domain.compid.php\?id=(\d+)/;
var as = document.getElementsByTagName('a');    
var result = "result: ";

for (var i = 0; i < as.length; i++) {
    var match = re(as[i].getAttribute('href'));
    if (match != null) {
        result += match[1] + ",";
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.