How do I parse HTML using regular expressions in C#?

Question

For example, given HTML code

<s2> t1 </s2>  <img src='1.gif' />  <span> span1 <span/>

I am trying to obtain

1.  <s2>
2.  t1
3. </s2>
4. <img src='1.gif' />
5. <span>
6. span1
7. <span/>

How do I do this using regular expressions in C#?

In my case, the HTML input is not well-formed XML like XHTML. Therefore I can not use XML parsers to do this.

How to rewrite what regular expression? Please rephrase your question. Your samples are unreadable. — Michael Petrotta
– Michael Petrotta, Commented Oct 15, 2009 at 1:54
Your question doesn't make sense. You say you want to parse HTML, but the example code you posted isn't HTML. — Jörg W Mittag
– Jörg W Mittag, Commented Oct 15, 2009 at 2:41
Canonical question: RegEx match open tags except XHTML self-contained tags — Peter Mortensen
– Peter Mortensen, Commented Nov 11, 2014 at 0:03

bobbymcr · Accepted Answer · 2009-10-15 01:57:00Z

6

Regular expressions are a very poor way to parse HTML. If you can guarantee that your input will be well-formed XML (i.e. XHTML), you can use XmlReader to read the elements and then print them out however you like.

answered Oct 15, 2009 at 1:57

bobbymcr

24.3k3 gold badges59 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mike108 Over a year ago

In my case, the input is NOT well-formed xml.

bobbymcr Over a year ago

Then you're in for a very complex problem, in general... HTML parsing with all of its implied elements, optional end tags, etc. is no fun. However, you might be able to leverage an existing library, such as... codeplex.com/htmlagilitypack

Jörg W Mittag Over a year ago

No, regular expressions are not "a poor way to parse HTML", because that would imply that regular expressions can parse HTML at all, which is not the case. It is mathematically proven that regular expressions cannot parse HTML. In fact, pretty much every college student has to prove this at some point during a homework assignment or exam or something.

Jörg W Mittag · Accepted Answer · 2009-10-15 02:36:56Z

4

This has already been answered literally dozens of times, but it bears repeating: regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language (as probably every college student in the last decade has proved at least once), and therefore cannot be parsed by regular expressions.

answered Oct 15, 2009 at 2:36

Jörg W Mittag

371k79 gold badges457 silver badges666 bronze badges

Comments

nickytonline · Accepted Answer · 2009-10-15 02:12:52Z

3

You might want to try the Html Agility Pack, http://www.codeplex.com/htmlagilitypack. It even handles malformed HTML.

answered Oct 15, 2009 at 2:12

nickytonline

6,9797 gold badges47 silver badges76 bronze badges

Comments

Mike108 · Accepted Answer · 2009-10-15 03:05:06Z

0

I used this regx in C#, and it works. Thanks for all your answers.

<([^<]*)>|([^<]*)

answered Oct 15, 2009 at 3:05

Mike108

2,1457 gold badges34 silver badges47 bronze badges

2 Comments

Robert Rossney Over a year ago

It works with the data you've tested it with. If that's all the data you ever need to process with it, then fine.

Peter Hoffmann Over a year ago

If not: now you've got two problems.

junmats · Accepted Answer · 2009-10-15 02:33:43Z

-3

you might want to simply use string functions. make < and > as your indicator for parsing.

answered Oct 15, 2009 at 2:33

junmats

1,9142 gold badges23 silver badges37 bronze badges

Collectives™ on Stack Overflow

How do I parse HTML using regular expressions in C#?

5 Answers 5

3 Comments

Comments

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related