0

Currently I'm trying to parse some html and return an array with the values inside each element.

For example:

if I pass the below markup into a function

var element = "td";
var html = "<tr><td>1</td><td>2</td></tr>";
return Regex.Split(html, string.Format("<{0}*.>(.*?)</{0}>", element));

And I'm expecting back an array[] { 1, 2 }

What does my regex need to look like? Currently my array is coming back with far to many elements and my regex skills are lacking

2
  • 6
    Parsing (X)HTML with RegEx!?!!!!??? That joke never gets old, does it? Commented Sep 27, 2010 at 20:37
  • 2
    Before you continue down this path, read this (edit - dtb beat me to it) Commented Sep 27, 2010 at 20:39

3 Answers 3

6

Do not parse HTML using regular expressions.

Instead, you should use the HTML Agility Pack.

For example:

HtmlDocument doc = new HtmlDocument();
doc.Parse(str);

IEnumerable<string> cells = doc.DocumentNode.Descendants("td").Select(td => td.InnerText);
Sign up to request clarification or add additional context in comments.

Comments

1

You really should not use regex to parse html. html is not a regular language, so regex isn't capable of interpreting it properly. You should use a parser.

c# has html parsers for this.

Comments

0

The method to load the html has changed since the original answer, it is now:

// From File
var doc = new HtmlDocument();
doc.Load(filePath);

// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);

// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);

However if you follow the documentation as per the provided link above you should be fine :)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.