6

Assume that we have the following HTML strings.

string A = " <table width=325><tr><td width=325>test</td></tr></table>"
string B = " <<table width=325><tr><td width=325>test</td></table>"

How can we validate A or B in C# according to HTML specifications?

A should return true whereas B should return false.

5
  • You could parse the string and add up all of the < and > characters. If either of them is an odd amount you could assume it's invalid (for this case) Commented Sep 22, 2011 at 18:43
  • I think both of the html you have provided completely incorrect according to spec. Commented Sep 22, 2011 at 18:49
  • There is a number of HTML tags in HTML4 and HTML5 that do not require the use of a closing tag for valid HTML: optgroup, option, p, tbody, td, tr, tfoot, thead, th Commented Sep 22, 2011 at 18:53
  • I fixed the </tr> on string A. Commented Sep 22, 2011 at 19:03
  • Yes as @alex said, counting < and > does not work at all. Commented Sep 22, 2011 at 19:04

3 Answers 3

15

For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(
    "WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");

foreach (var error in htmlDoc.ParseErrors)
{
    // Prints: TagNotOpened
    Console.WriteLine(error.Code);
    // Prints: Start tag <u> was not found
    Console.WriteLine(error.Reason); 
}

Checking a HTML string for unopened tags

Sign up to request clarification or add additional context in comments.

6 Comments

+1 for using an existing library instead of trying to hack it together yourself.
just a warning, whilst the above example works, this: "<p>example<p>" doesnt trigger any parser errors.
This also does not throw any parser errors: <script type="text/javascript" src="/Scripts/clipboard.min.js" <="" script="">
@AndrewBullock That is because it indeed is valid HTML.
@JeremeGuenther Can you link to the github bug ticket that you surely created when you stumbled over this issue?
|
1

One point to start with is checking if it's valid XML.

by the way, I think both your examples are incorrect as you've left out the </tr> from both.

5 Comments

HTML is not valid XML, XHTML is.
@alex, that's true but at least it could be a start as all other checks would be really hard
There is a number of HTML tags in HTML4 and HTML5 that do not require the use of a closing tag for valid HTML: optgroup, option, p, tbody, td, tr, tfoot, thead, th
@thekip As long as there is one single HTML element with no closing tag, the entire XML-validity check would be pointless. And there's a pretty good chance there is one.
XML is much more strict than HTML and XMLReader does not give proper results.
0

http://web.archive.org/web/20110820163031/http://markbeaton.com/SoftwareInfo.aspx?ID=81a0ecd0-c41c-48da-8a39-f10c8aa3f931

Github link: https://github.com/markbeaton/TidyManaged

This guy has written a .NET wrapper for HTMLTidy. I haven't used it but it may be what you are looking for.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.