How to load html source without the URL?

Question

I have written code that uses HtmlAgilityPack to get the ids and xpaths given the URL. I want to use that code but the website that I want to use it on only has one URL. Basically, content in the website changes but the URL does not. So I can get to all the pages I want to get to but how do you download the HTML source for that page without using the URL in C#?

internal Dictionary<string, string> GetIDsAndXPaths(string url)
{
    var web = new HtmlWeb();
    var webidsAndXPaths = new Dictionary<string, string>();
    var doc = web.Load(url);
    var nodes = doc.DocumentNode.SelectNodes("//*[@id]");
    if (nodes == null) return webidsAndXPaths;
    // more code to get ids and such
    return webidsAndXPaths;
}

is the page a SPA that is using javascript to dynamically populate the content on the page? — Claies
– Claies, Commented Apr 28, 2014 at 21:08
Yes, it is using javascript to dynamically create page content in an .aspx webpage. — SteveT
– SteveT, Commented Apr 28, 2014 at 21:19
For some sites a page is a collection of html fragments that are dynamically generated or selected (using javascript). Often, the URL will not reflect the changed page content. In these cases it may be impossible to find specific source as you are asking. — Jasen
– Jasen, Commented Apr 28, 2014 at 21:22

Jonathan Kittell · Accepted Answer · 2014-04-28 21:48:14Z

1

You could use the WebDriver to navigate the page where you want to get the page source. Then once the WebDriver is on the page you want just have the WebDriver download the page source. Pass in the page source to web.Load through the variable named "page".

internal Dictionary<string, string> GetIDsAndXPaths()
{
    var web = new HtmlWeb();
    var webidsAndXPaths = new Dictionary<string, string>();
    var page = driver.PageSource; // Gets the source of the page last loaded by the browser

    const string path = @"C:\temp\myHtml.html";
    var sw = new StreamWriter(path, false);
    sw.Write(page);
    sw.Close();
    const string url = path;
    var doc = web.Load(page);
    var nodes = doc.DocumentNode.SelectNodes("//*[@id]");
    if (nodes == null) return webidsAndXPaths;
    // more code to get ids and such
    return webidsAndXPaths;
}

edited Apr 28, 2014 at 21:48

answered Apr 28, 2014 at 21:33

Jonathan Kittell

7,53317 gold badges59 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to load html source without the URL?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related