2

How would I use regex to parse the following:

<b>HelloWorld</b>
<p>This is a test</p>
<a href="myUrl">Google</a>

All html tags need to be removed and the urls extracted from hyperlink tags, and the result should be:

HelloWorld
This is a test
myUrl
2

2 Answers 2

8

I know that's not the answer you expect but you shouldn't try parsing HTML with regular expressions. HTML is way to complicated to be parsed by regexes, there are all sorts of stuff that can go wrong. It is very hard to write a regex that parses HTML reliably well, I'm not even sure if it's possible.

Use something like the Beautiful Soup or HTML Agility Pack for .NET. Or you can create your own parser with a parser generator.

Sign up to request clarification or add additional context in comments.

8 Comments

The futility of telling people "don't do HTML with regex" never stops to amaze me. Stack Overflow is full of this advice, as is the rest of the internet. As if no-one ever reads or believes it. Anyway, you have my vote. :)
Tomalak: A lot of areas covered on stackoverflow have these typical recurring questions - and that's why having per-tag FAQs on stackoverflow would be great. stackoverflow.uservoice.com/pages/1722-general/suggestions/…
Nobody reads an FAQ, that's more or less a fact. If people would read/google before they ask, the number of questions per day would reduce drastically.
Tomalak: That's true, but as far as I understand, the idea is to have a well-written piece of text we can direct the occasional question askers to instead of having to explain it all the time or looking up a similar question with a good answer.
I guess that people would go for the rep rather than pointing the OP to an FAQ. If anyone in the thread solves the immediate problem of the OP, their answer will be accepted instead of a "boring" FAQ pointer, however correct it may be.
|
1

You should use a parser for this. Regexes just won't do. You could use recursive regex patterns, but I don't think they're supported by the .NET regex engine.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.