1

I need to create a data index of HTML pages provided to a service by essentially grabbing all text on them and putting them in a string to go into a storage system.

If this were GUI based, I would simply Ctrl+A on the HTML page, copy it, then go to Notepad and Ctrl+V. Simples. If I can do it via good old point n' click, then surely there must be a way to do it programmatically, but I'm struggling to find anything useful.

The HTML docs in question are being loaded for rendering currently using the System.Windows.Controls.WebBrowser class, so I wonder if its somehow possible to grab the data from there?

I'm going to keep hunting, but any pointers would be very appreciated.

Note: We don't want the HTML source code, and would also really rather not have to parse all the source code to get the text unless we absolutely have to.

2
  • So you're saying you have the full html document as a string, but you want to get only the text nodes, and not use any of the html tags? Commented Oct 21, 2010 at 15:53
  • No, we currently have the HTML documents in a directory, use the webBrowser.Navigate() call to preview them in the GUI before indexing. I'd rather not muck around with TextReaders though, plus that would grab the HTML tags. We do indeed only want the text nodes ideally. Commented Oct 21, 2010 at 16:18

2 Answers 2

1

If I understand your problem correctly, you will have to do a bit of work to get the data.

WebBrowser browser=new WebBrowser();  // This is what you have
HtmlDocument doc = browser.Document;  // This gives you the browser contents
String content = 
    (((mshtml.HTMLDocumentClass)(doc.DomDocument)).documentElement).innerText;

That last line is the browser's view of the rendered content.

Sign up to request clarification or add additional context in comments.

1 Comment

This is helpful, with the only issue being that we're currently using a Control.WebBrowser, not a Forms.WebBrowser. However, it seems by far the best bet so far, I'll see what I can do with it. Thank you :)
0

This looks like it might be quite helpful.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.