I've noticed a lot of javascript snippets (and element attributes) creeping into my supposedly "text node" extractions. There have also been some cases where malformed HTML caused the whole parsing operation to fail. So I'm looking to replace the htmlparser library in my own project with something a little better.
1
Do you need to do a full parse of the HTML? If you're just looking for specific values within the contents (a specific tag/param), then a simple regular expression might be enough, and could very well be faster.