0

How do I parse HTML using regular expressions in C#?

For example, given HTML code

<s2> t1 </s2>  <img src='1.gif' />  <span> span1 <span/>

I am trying to obtain

1.  <s2>
2.  t1
3. </s2>
4. <img src='1.gif' />
5. <span>
6. span1
7. <span/>

How do I do this using regular expressions in C#?

In my case, the HTML input is not well-formed XML like XHTML. Therefore I can not use XML parsers to do this.

4
  • How to rewrite what regular expression? Please rephrase your question. Your samples are unreadable. Commented Oct 15, 2009 at 1:54
  • 1
    @Michael Petrotta, I have edited my post. Commented Oct 15, 2009 at 1:56
  • Your question doesn't make sense. You say you want to parse HTML, but the example code you posted isn't HTML. Commented Oct 15, 2009 at 2:41
  • Canonical question: RegEx match open tags except XHTML self-contained tags Commented Nov 11, 2014 at 0:03

5 Answers 5

6

Regular expressions are a very poor way to parse HTML. If you can guarantee that your input will be well-formed XML (i.e. XHTML), you can use XmlReader to read the elements and then print them out however you like.

Sign up to request clarification or add additional context in comments.

3 Comments

In my case, the input is NOT well-formed xml.
Then you're in for a very complex problem, in general... HTML parsing with all of its implied elements, optional end tags, etc. is no fun. However, you might be able to leverage an existing library, such as... codeplex.com/htmlagilitypack
No, regular expressions are not "a poor way to parse HTML", because that would imply that regular expressions can parse HTML at all, which is not the case. It is mathematically proven that regular expressions cannot parse HTML. In fact, pretty much every college student has to prove this at some point during a homework assignment or exam or something.
4

This has already been answered literally dozens of times, but it bears repeating: regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language (as probably every college student in the last decade has proved at least once), and therefore cannot be parsed by regular expressions.

Comments

3

You might want to try the Html Agility Pack, http://www.codeplex.com/htmlagilitypack. It even handles malformed HTML.

Comments

0

I used this regx in C#, and it works. Thanks for all your answers.

<([^<]*)>|([^<]*)

2 Comments

It works with the data you've tested it with. If that's all the data you ever need to process with it, then fine.
If not: now you've got two problems.
-3

you might want to simply use string functions. make < and > as your indicator for parsing.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.