2

I have written code that uses HtmlAgilityPack to get the ids and xpaths given the URL. I want to use that code but the website that I want to use it on only has one URL. Basically, content in the website changes but the URL does not. So I can get to all the pages I want to get to but how do you download the HTML source for that page without using the URL in C#?

internal Dictionary<string, string> GetIDsAndXPaths(string url)
{
    var web = new HtmlWeb();
    var webidsAndXPaths = new Dictionary<string, string>();
    var doc = web.Load(url);
    var nodes = doc.DocumentNode.SelectNodes("//*[@id]");
    if (nodes == null) return webidsAndXPaths;
    // more code to get ids and such
    return webidsAndXPaths;
}
4
  • What's wrong with using the URL? Commented Apr 28, 2014 at 21:07
  • 1
    is the page a SPA that is using javascript to dynamically populate the content on the page? Commented Apr 28, 2014 at 21:08
  • Yes, it is using javascript to dynamically create page content in an .aspx webpage. Commented Apr 28, 2014 at 21:19
  • For some sites a page is a collection of html fragments that are dynamically generated or selected (using javascript). Often, the URL will not reflect the changed page content. In these cases it may be impossible to find specific source as you are asking. Commented Apr 28, 2014 at 21:22

1 Answer 1

1

You could use the WebDriver to navigate the page where you want to get the page source. Then once the WebDriver is on the page you want just have the WebDriver download the page source. Pass in the page source to web.Load through the variable named "page".

internal Dictionary<string, string> GetIDsAndXPaths()
{
    var web = new HtmlWeb();
    var webidsAndXPaths = new Dictionary<string, string>();
    var page = driver.PageSource; // Gets the source of the page last loaded by the browser

    const string path = @"C:\temp\myHtml.html";
    var sw = new StreamWriter(path, false);
    sw.Write(page);
    sw.Close();
    const string url = path;
    var doc = web.Load(page);
    var nodes = doc.DocumentNode.SelectNodes("//*[@id]");
    if (nodes == null) return webidsAndXPaths;
    // more code to get ids and such
    return webidsAndXPaths;
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.