Simple regex question (regex included)

Question

I have these strings in a string stream:

"do=whoposted&amp;t=1934067" rel=nofollow>61</A></TD><TD class=alt2 align=middle>5,286</TD></TR><TR><TD id=td_threadstatusicon_1911046 class=alt1><IMG id=thread_statusicon_1911046 border=0 alt="" src="http://url.com/forum/images/statusicon/thread_new.gif"> </TD><TD class=alt2><IMG title=Film border=0 alt=Film src="http://url.com/forum/images/icons/new.png"></TD><TD id=td_threadtitle_1911046 class=alt1 title="http://lulzimg.com/i14/7bd11b.jpg &#10; &#10;Complete name : cool-thread.."><DIV><A id=thread_gotonew_1911046 href="http://url.com/forum/f80/cool-topic-new/"><IMG class=inlineimg title="Go to first new post" border=0 alt="Go to first new post" src="http://url.com/forum/images/buttons/firstnew.gif"></A> [MULTI] <A style="FONT-WEIGHT: bold" id=thread_title_1911046 href="http://url.com/forum/f80/cool-topic-name-1911046/">Cool Topic Name</A> </DIV><DIV class=smallfont><SPAN style="CURSOR: pointer" onclick="window.open('http://url.com/forum/members/u2031889/', '_self')">m3no</SPAN> </DIV></TD><TD class=alt2 title="Replies: 11, Views: 1,554"><DIV style="TEXT-ALIGN: right; WHITE-SPACE: nowrap" class=smallfont>Today <SPAN class=time>08:04 AM</SPAN><BR>by <A href="http://url.com/forum/members/u1131830/" rel=nofollow>karetsos</A> <A "

Currently I use this:

Regex pattern = new Regex ( "<A\\s+href=\"([^\"]*)\">([^\\x00]*?)\\s+id=thread_title_(\\S+)</A>" );

MatchCollection matches = pattern.Matches ( doc.ToString ( ) );

foreach ( Match match in matches )
{
    int id = Convert.ToInt32 ( match.Groups [ 1 ].Value );

    string name = match.Groups [ 3 ].Value;
    string link = match.Groups [ 2 ].Value;

    ...
}

But it doesn't match anything.

All I am trying to extract are:

IDs: 942321, 512147.

Names: "Visible Thread Name", "Cool Thread"

Links: "http://url.com/forum/f80/new-topic-name-942321", "http://url.com/forum/f80/cool-topic-name-512147"

Any ideas on how to fix it?

By default, regexes are case-sensitive (a != A). One possible solution is to pass RegexOptions.IgnoreCase as the second parameter to your Regex constructor. — Heinzi
– Heinzi, Commented Mar 17, 2011 at 15:19
There is something wrong with your regex, between 'thread_title_' and '</A>' you should have 2 groups (id and name), but you only have one. You should rebuild it from scratch and go part by part and see when it doesn't work. — Kipotlov
– Kipotlov, Commented Mar 17, 2011 at 15:47

Heinzi · Accepted Answer · 2011-03-17 16:09:25Z

1

List of issues that I found:

By default, regexes are case-sensitive (a != A). One possible solution is to pass RegexOptions.IgnoreCase as the second parameter to your Regex constructor.
id=thread... you seem to be missing the opening " after id
After matching the id you suddenly stop... don't you want to match the name as well in a third group? I guess your regex should end like this:
```
id=\"thread_title_([0-9]+)\">([^<]+)</a>
```
Oh, and don't close the a tag after the href, because the thread_title_id is still inside the tag:

href=\"([^\"]*)\">: remove the > at the end
In addition, remove that strange [^\\x00]*? group. What's that good for anyway?
After capturing the thread_title_id, you need to ignore stuff until the closing >, in order to ignore the style=... attribute.

Full solution (warning, spoiler ahead). The @"..." syntax ensures that you don't need to escape backslashes (but you need to escape quotes by double quotes).

Regex pattern = new Regex (@"<a\s+href=""([^""]*)""\s+id=""thread_title_([0-9]+)""[^>]*>([^<]+)</a>");

BTW, for debugging this I used the following tool, which I can recommend and which automatically provides an escaped version:

http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

edited Mar 17, 2011 at 16:09

answered Mar 17, 2011 at 15:55

Heinzi

173k61 gold badges386 silver badges554 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Joan Venge Over a year ago

Thanks, can you please post the full version just so that it's easier to see? I mean just the escaped version if possible.

Joan Venge Over a year ago

I tried this with IgnoreCase but didn't work either: "<a\\s+href=\"([^\"]*)\">([^\\x00]*?)\\s+id=\"thread_title_([0-9]+)\">([^<]+)</a>"

Heinzi Over a year ago

@Joan: No, I won't post the full solution, since it would encourage you to just copy&paste it without fully understanding the regex. ... OK, just kidding, here it is. :-) See the edited answer.

Joan Venge Over a year ago

Thanks for updating, this is really strange. I get different output than IE's view source shows me. I will have to get a new sample and see how different it is.

Heinzi Over a year ago

@Joan: PS: For parsing "real" HTML, using a HTML parser is usually a better solution than regex. Consider, for example, the HTML Agility Pack.

|

Joachim VR · Accepted Answer · 2011-03-17 15:27:09Z

1

This returns what you need. No need to be overly strict here:

<a.+href=".*topic\-name\-(\S+)\/.+thread_title_(\S+)"

answered Mar 17, 2011 at 15:27

Joachim VR

2,3401 gold badge15 silver badges26 bronze badges

3 Comments

Joan Venge Over a year ago

Thanks, btw can you please post the escaped version? I am not sure which characters to escape other than quotes.

Joachim VR Over a year ago

you can put an @ in front of your string, indicating to take it literal, and disabling escape characters (except for double quotes of course). This is used a lot in paths, where there's a lot of backslashes: string aRegex = @"<a.+href=\".*topic\-name\-(\S+)\/.+thread_title_(\S+)\""

Joan Venge Over a year ago

Thanks will try it when I get back home :O

Collectives™ on Stack Overflow

Simple regex question (regex included)

2 Answers 2

8 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related