Processing HTML document with C#

Question

I have a few hundred static HTML files that need to be processed.

They contain links like this

 <a href="http://www.mysite.com/">Link</a>

I need to add ?ref=self to any url that begins with http://www.mysite.com and becomes

<a href="http://www.mysite.com/?ref=self">Link</a>

however, I do not know whether it's going to be http://www.mysite.com or http://www.mysite.com/ also it could be linked to a sub directory.

What's the most efficient way to do this? in C#

I asked myself the same question and gave your question a upvote. — jgauffin
– jgauffin, Commented Aug 22, 2010 at 6:20

strager · Accepted Answer · 2010-08-22 08:30:21Z

1

What's the most efficient way to do this? in C#

Look for the string http://www.mysite.com.
If it doesn't exist, go to 7.
Look for the next ".
If it doesn't exist, error.
Insert ?ref=self before the ".
Go to 1.
Return.

This can be accomplished with the following regular expression substitution:

s#http://www.mysite.com[^"]*#&?ref=self#g

A nicer (more expressive) way would be to use an HTML parser and XPath.

edited Aug 22, 2010 at 8:30

answered Aug 22, 2010 at 5:43

strager

90.3k27 gold badges139 silver badges181 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Timwi Over a year ago

Bug: The href attribute could be in single quotes ☺

strager Over a year ago

@Timwi, That's not a bug. The OP clearly stated what the expected input is (which didn't include '), and that efficiency was a factor (so they say...).

Timwi Over a year ago

I don’t see where he stated that. The OP clearly stated that the expected input is HTML. He neither stated that it is a special subset of HTML, nor did he state that his examples are exhaustive. If I hadn’t commented, he might not have realised that any of his HTML files could contain href attributes with single quotes and that your algorithm would silently skip them.

Community · Accepted Answer · 2017-05-23 12:07:08Z

1

Parsing HTML can be tricky as HTML often contains poorly formed tags and attributes. I suggest looking into an existing HTML parsing library to do your heavy lifting, or, using XSLT to transform valid (x)HTML to your desired output.

This question What is the best way to parse html in C#? has some good links to HTML parsing libraries for C#.

edited May 23, 2017 at 12:07

CommunityBot

11 silver badge

answered Aug 22, 2010 at 5:39

jscharf

5,9293 gold badges26 silver badges16 bronze badges

3 Comments

jgauffin Over a year ago

A html parsing library is like taking a cannon to a duck hunt in this case.

strager Over a year ago

@jgauffin, I don't see how. It's definitely an appropriate solution.

jgauffin Over a year ago

Because the URI's are quite easy to find and replace in this case.

bjhamltn · Accepted Answer · 2010-08-22 08:55:12Z

0

You could use Page.Request.UrlReferrer to detect where the request came from.

answered Aug 22, 2010 at 8:55

bjhamltn

4105 silver badges6 bronze badges

Collectives™ on Stack Overflow

Processing HTML document with C#

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related