3

The only solution I could find was using:

            mshtml.HTMLDocument htmldocu = new mshtml.HTMLDocument();
            htmldocu .createDocumentFromUrl(url, "");

and I am not sure about the performance, it should be better than loading the html file in a WebBrowser and then grab the HtmlDocument from there. Anyhow, that code does not work on my machine. The application crashes when it tries to execute the second line.

Has anyone an approach to achieve this efficiently or any other way?

NOTE: Please understand that I need the HtmlDocument object for DOM processing. I do not need the html string.

1
  • Did you find any solution on this ? Commented Mar 13, 2012 at 11:31

2 Answers 2

1

Use the DownloadString method of the WebClient object. e.g.

WebClient client = new WebClient();
string reply = client.DownloadString("http://www.google.com");

In the above example, after executed, reply will contain the html markup of the endpoint http://www.google.com.

WebClient.DownloadString MSDN

Sign up to request clarification or add additional context in comments.

1 Comment

The idea is getting the HtmlDocument object for DOM parsing, and not the string of the html. Webclient will just return the html string, and not the HtmlDocument.
1

In an attempt to answer your actual question from four years ago (at the time of me posting this answer), I'm providing a working solution. I wouldn't be surprised if you found another way to do this, either, so this is mostly for other people searching for a similar solution. Keep in mind, however, that this is considered

  1. somewhat obsolete (the actual use of HtmlDocument)
  2. not the best way to handle HTML DOM parsing (the preferred solution is to use HtmlAgilityPack or CsQuery or some other method using actual parsing and not regular expressions)
  3. extremely hacky and therefore not the safest/most compatible way to do it
  4. you really should not be doing what I'm about to show

Additionally, keep in mind that HtmlDocument is really just a wrapper for mshtml.HTMLDocument2, so it is technically slower than just using a COM wrapper directly, but I completely understand the use case simply for ease of coding.

If you're cool with all of the above, here's how to accomplish what you want.

public class HtmlDocumentFactory
{
  private static Type htmlDocType = typeof(System.Windows.Forms.HtmlDocument);
  private static Type htmlShimManagerType = null;
  private static object htmlShimSingleton = null;
  private static ConstructorInfo docCtor = null;

  public static HtmlDocument Create()
  {
    if (htmlShimManagerType == null)
    {
      // get a type reference to HtmlShimManager
      htmlShimManagerType = htmlDocType.Assembly.GetType(
        "System.Windows.Forms.HtmlShimManager"
        );
      // locate the necessary private constructor for HtmlShimManager
      var shimCtor = htmlShimManagerType.GetConstructor(
        BindingFlags.NonPublic | BindingFlags.Instance, null, new Type[0], null
        );
      // create a new HtmlShimManager object and keep it for the rest of the
      // assembly instance
      htmlShimSingleton = shimCtor.Invoke(null);
    }

    if (docCtor == null)
    {
      // get the only constructor for HtmlDocument (which is marked as private)
      docCtor = htmlDocType.GetConstructors(
        BindingFlags.NonPublic | BindingFlags.Instance
        )[0];
    }

    // create an instance of mshtml.HTMLDocument2 (in the form of 
    // IHTMLDocument2 using HTMLDocument2's class ID)
    object htmlDoc2Inst = Activator.CreateInstance(Type.GetTypeFromCLSID(
      new Guid("25336920-03F9-11CF-8FD0-00AA00686F13")
      ));
    var argValues = new object[] { htmlShimSingleton, htmlDoc2Inst };
    // create a new HtmlDocument without involving WebBrowser
    return (HtmlDocument)docCtor.Invoke(argValues);
  }
}

To use it:

var htmlDoc = HtmlDocumentFactory.Create();
htmlDoc.Write("<html><body><div>Hello, world!</body></div></html>");
Console.WriteLine(htmlDoc.Body.InnerText);
// output:
// Hello, world!

I have not tested this code directly -- I have translated it from an old Powershell script that needed the same functionality you're requesting. If it fails, let me know. The functionality is there but the code might need very minor tweaking to get working.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.