0

I need to use regex to search through an html file and replace href="pagename" with href="pages/pagename"

Also the href could be formatted like HREF = 'pagename'

I do not want to replace any hrefs that could be upper or lowercase that begin with http, ftp, mailto, javascript, #

I am using c# to develop this little app in.

2
  • So...you want someone else to write your regex for you? Why not just take the time to learn it for yourself. Commented May 2, 2011 at 17:47
  • Very rarely use regex. Pushed for time on this little number so thought i'd send out my first SOS. Commented May 2, 2011 at 18:10

3 Answers 3

3

HTML manipulation through Regex is not recommended since HTML is not a "regular language." I'd highly recommend using the HTML Agility Pack instead. That gives you a DOM interface for HTML.

Sign up to request clarification or add additional context in comments.

1 Comment

+1/2 if I could. While you are right about html not being a regular language, the given problem does not require an parser, this specific case can be done with regex.
0

I have not tested with many cases, but for this case it worked:

var str = "href='page' href = 'www.goo' href='http://' href='ftp://'";
Console.WriteLine(Regex.Replace(str, @"href ?= ?(('|"")([a-z0-9_#.-]+)('|""))", "x", RegexOptions.IgnoreCase));

Result:

"x x href='http://' href='ftp://'"

You better hold backup files before running this :P

Comments

0

There are lots of caveats when using a find/replace with HTML and XML. The problem is, there are many variations of syntax which are permitted. (and many which are not permitted but still work!)

But, you seem to want something like this:

search for

([Hh][Rr][Ee][Ff]\s*=\s*['"])(\w+)(['"])

This means:

  • [Hh]: any of the items in square-brackets, followed by
  • \s*: any number of whitespaces (maybe zero),
  • =
  • \s* any more whitespaces,
  • ['"] either quote type,
  • \w+: a word (without any slashes or dots - if you want to include .html then use [.\w]+ instead ),
  • and ['"]: another quote of any kind.

replace with

$1pages/$2$3

Which means the things in the first bracket, then pages/, then the stuff in the second and third sets of brackets.

You will need to put the first string in @" quotes, and also escape the double-quotes as "". Note that it won't do anything even vaguely intelligent, like making sure the quotes match. Warning: try never to use as "any character" (.) symbol in this kind of regex, as it will grab large sections of text, over and including the next quotation mark, possibly up to the end of the file!

see a regex tutorial for more info, e.g. http://www.regular-expressions.info/dotnet.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.