0

I have these strings in a string stream:

"do=whoposted&amp;t=1934067" rel=nofollow>61</A></TD><TD class=alt2 align=middle>5,286</TD></TR><TR><TD id=td_threadstatusicon_1911046 class=alt1><IMG id=thread_statusicon_1911046 border=0 alt="" src="http://url.com/forum/images/statusicon/thread_new.gif"> </TD><TD class=alt2><IMG title=Film border=0 alt=Film src="http://url.com/forum/images/icons/new.png"></TD><TD id=td_threadtitle_1911046 class=alt1 title="http://lulzimg.com/i14/7bd11b.jpg &#10; &#10;Complete name : cool-thread.."><DIV><A id=thread_gotonew_1911046 href="http://url.com/forum/f80/cool-topic-new/"><IMG class=inlineimg title="Go to first new post" border=0 alt="Go to first new post" src="http://url.com/forum/images/buttons/firstnew.gif"></A> [MULTI] <A style="FONT-WEIGHT: bold" id=thread_title_1911046 href="http://url.com/forum/f80/cool-topic-name-1911046/">Cool Topic Name</A> </DIV><DIV class=smallfont><SPAN style="CURSOR: pointer" onclick="window.open('http://url.com/forum/members/u2031889/', '_self')">m3no</SPAN> </DIV></TD><TD class=alt2 title="Replies: 11, Views: 1,554"><DIV style="TEXT-ALIGN: right; WHITE-SPACE: nowrap" class=smallfont>Today <SPAN class=time>08:04 AM</SPAN><BR>by <A href="http://url.com/forum/members/u1131830/" rel=nofollow>karetsos</A> <A "

Currently I use this:

Regex pattern = new Regex ( "<A\\s+href=\"([^\"]*)\">([^\\x00]*?)\\s+id=thread_title_(\\S+)</A>" );

MatchCollection matches = pattern.Matches ( doc.ToString ( ) );

foreach ( Match match in matches )
{
    int id = Convert.ToInt32 ( match.Groups [ 1 ].Value );

    string name = match.Groups [ 3 ].Value;
    string link = match.Groups [ 2 ].Value;

    ...
}

But it doesn't match anything.

All I am trying to extract are:

IDs: 942321, 512147.

Names: "Visible Thread Name", "Cool Thread"

Links: "http://url.com/forum/f80/new-topic-name-942321", "http://url.com/forum/f80/cool-topic-name-512147"

Any ideas on how to fix it?

3
  • By default, regexes are case-sensitive (a != A). One possible solution is to pass RegexOptions.IgnoreCase as the second parameter to your Regex constructor. Commented Mar 17, 2011 at 15:19
  • Still returns 0 matches. Commented Mar 17, 2011 at 15:23
  • There is something wrong with your regex, between 'thread_title_' and '</A>' you should have 2 groups (id and name), but you only have one. You should rebuild it from scratch and go part by part and see when it doesn't work. Commented Mar 17, 2011 at 15:47

2 Answers 2

1

List of issues that I found:

  • By default, regexes are case-sensitive (a != A). One possible solution is to pass RegexOptions.IgnoreCase as the second parameter to your Regex constructor.

  • id=thread... you seem to be missing the opening " after id

  • After matching the id you suddenly stop... don't you want to match the name as well in a third group? I guess your regex should end like this:

    id=\"thread_title_([0-9]+)\">([^<]+)</a>
    
  • Oh, and don't close the a tag after the href, because the thread_title_id is still inside the tag:

    href=\"([^\"]*)\">: remove the > at the end

  • In addition, remove that strange [^\\x00]*? group. What's that good for anyway?

  • After capturing the thread_title_id, you need to ignore stuff until the closing >, in order to ignore the style=... attribute.


Full solution (warning, spoiler ahead). The @"..." syntax ensures that you don't need to escape backslashes (but you need to escape quotes by double quotes).

Regex pattern = new Regex (@"<a\s+href=""([^""]*)""\s+id=""thread_title_([0-9]+)""[^>]*>([^<]+)</a>");

BTW, for debugging this I used the following tool, which I can recommend and which automatically provides an escaped version:

Sign up to request clarification or add additional context in comments.

8 Comments

Thanks, can you please post the full version just so that it's easier to see? I mean just the escaped version if possible.
I tried this with IgnoreCase but didn't work either: "<a\\s+href=\"([^\"]*)\">([^\\x00]*?)\\s+id=\"thread_title_([0-9]+)\">([^<]+)</a>"
@Joan: No, I won't post the full solution, since it would encourage you to just copy&paste it without fully understanding the regex. ... OK, just kidding, here it is. :-) See the edited answer.
Thanks for updating, this is really strange. I get different output than IE's view source shows me. I will have to get a new sample and see how different it is.
@Joan: PS: For parsing "real" HTML, using a HTML parser is usually a better solution than regex. Consider, for example, the HTML Agility Pack.
|
1

This returns what you need. No need to be overly strict here:

<a.+href=".*topic\-name\-(\S+)\/.+thread_title_(\S+)"

3 Comments

Thanks, btw can you please post the escaped version? I am not sure which characters to escape other than quotes.
you can put an @ in front of your string, indicating to take it literal, and disabling escape characters (except for double quotes of course). This is used a lot in paths, where there's a lot of backslashes: string aRegex = @"<a.+href=\".*topic\-name\-(\S+)\/.+thread_title_(\S+)\""
Thanks will try it when I get back home :O

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.