How do I find the text within a div in the source of a web page using C#

Question

How can I get the HTML code from a website, save it, and find some text by using a LINQ expression?

I'm using the following code to get the source of a web page:


public static String code(string Url)
{
    HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(Url);
    myRequest.Method = "GET";
    WebResponse myResponse = myRequest.GetResponse();
    StreamReader sr = new StreamReader(myResponse.GetResponseStream(),
        System.Text.Encoding.UTF8);
    string result = sr.ReadToEnd();
    sr.Close();
    myResponse.Close();
    
    return result;
}

How do I find the text within a div in the source of the web page?

Depends how smart search should be. A simple Contains call might be "good enough." — ashes999
– ashes999, Commented May 20, 2013 at 3:41
Look into using HTMLAgility pack, Fizzler or CSQuery to get the div/text once you have the HTML, anything else is too error prone. — jammykam
– jammykam, Commented May 20, 2013 at 3:43
@GeorgeDuckett That doesn't look like a duplicate of this question, the question you link to is only about retrieving the source, this question is also about querying the DOM. — Mark Rotteveel
– Mark Rotteveel, Commented May 20, 2013 at 8:15
@Mark: Sorry you're quite right, missed the text at the bottom. — George Duckett
– George Duckett, Commented May 20, 2013 at 8:17

Santosh Panda · Accepted Answer · 2013-05-20 04:38:30Z

190

Better you can use the Webclient class to simplify your task:

using System.Net;

using (WebClient client = new WebClient())
{
    string htmlCode = client.DownloadString("http://somesite.com/default.html");
}

answered May 20, 2013 at 4:38

Santosh Panda

7,3418 gold badges45 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dave Chandler Over a year ago

Any idea why I get this error? 'System.Net.WebClient': type used in a using statement must be implicitly convertible to 'System.IDisposable'

user3916429 Over a year ago

For the using requirement Clearly shown for everyone to use : +1

Himanshu Patel Over a year ago

For those who are getting http 403 error, add client.Headers.Add("user-agent", "Fiddler"); Replace Fiddler with any text you want.

Toni · Accepted Answer · 2021-04-20 06:19:23Z

120

Getting HTML code from a website. You can use code like this:

string urlAddress = "http://google.com";

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

if (response.StatusCode == HttpStatusCode.OK)
{
    Stream receiveStream = response.GetResponseStream();
    StreamReader readStream = null;
    if (String.IsNullOrWhiteSpace(response.CharacterSet))
        readStream = new StreamReader(receiveStream);
    else
        readStream = new StreamReader(receiveStream,
            Encoding.GetEncoding(response.CharacterSet));
    string data = readStream.ReadToEnd();
    response.Close();
    readStream.Close();
}

This will give you the returned HTML from the website. But find text via LINQ is not that easy. Perhaps it is better to use regular expression but that does not play well with HTML.

edited Apr 20, 2021 at 6:19

Toni

1,6935 gold badges21 silver badges28 bronze badges

answered May 20, 2013 at 3:47

SyntaxError

1,7522 gold badges13 silver badges21 bronze badges

3 Comments

Lightning3 Over a year ago

The Idea of using regex for html or XML is VERY bad coding practice... Going in Your Way - we should use goto keyword everywhere...

Mathieu VIALES Over a year ago

Actually, using regex to search a precise thing within HTML code can be a very decent solution. Atempting to build a HTML parser/interpretor based on regex, on the other hand, would be pure madness. It all depends on the context and the actual task that needs to be performed, but saying that "regex never plays well with HTML" simply isn't a global, unalinable truth. stackoverflow.com/a/1733489/6838730

Sam Hobbs Over a year ago

@MathieuVIALES programmers can get in a trap where they think something is simple and then it becomes complicated but they have too much investment in their first choice.

David Klempfner · Accepted Answer · 2019-01-13 05:57:21Z

41

Best thing to use is HTMLAgilityPack. You can also look into using Fizzler or CSQuery depending on your needs for selecting the elements from the retrieved page. Using LINQ or Regukar Expressions is just to error prone, especially when the HTML can be malformed, missing closing tags, have nested child elements etc.

You need to stream the page into an HtmlDocument object and then select your required element.

// Call the page and get the generated HTML
var doc = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNode.ElementsFlags["br"] = HtmlAgilityPack.HtmlElementFlag.Empty;
doc.OptionWriteEmptyNodes = true;

try
{
    var webRequest = HttpWebRequest.Create(pageUrl);
    Stream stream = webRequest.GetResponse().GetResponseStream();
    doc.Load(stream);
    stream.Close();
}
catch (System.UriFormatException uex)
{
    Log.Fatal("There was an error in the format of the url: " + itemUrl, uex);
    throw;
}
catch (System.Net.WebException wex)
{
    Log.Fatal("There was an error connecting to the url: " + itemUrl, wex);
    throw;
}

//get the div by id and then get the inner text 
string testDivSelector = "//div[@id='test']";
var divString = doc.DocumentNode.SelectSingleNode(testDivSelector).InnerHtml.ToString();

[EDIT] Actually, scrap that. The simplest method is to use FizzlerEx, an updated jQuery/CSS3-selectors implementation of the original Fizzler project.

Code sample directly from their site:

using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;

//get the page
var web = new HtmlWeb();
var document = web.Load("http://example.com/page.html");
var page = document.DocumentNode;

//loop through all div tags with item css class
foreach(var item in page.QuerySelectorAll("div.item"))
{
    var title = item.QuerySelector("h3:not(.share)").InnerText;
    var date = DateTime.Parse(item.QuerySelector("span:eq(2)").InnerText);
    var description = item.QuerySelector("span:has(b)").InnerHtml;
}

I don't think it can get any simpler than that.

edited Jan 13, 2019 at 5:57

David Klempfner

10.2k25 gold badges94 silver badges192 bronze badges

answered May 20, 2013 at 4:14

jammykam

17k2 gold badges39 silver badges73 bronze badges

5 Comments

Jamshaid K. Over a year ago

What if I want to invoke a specific button on the web page? @jammykam

jammykam Over a year ago

You can't do that with a screen scraper afaik, you would have to use anything like Selenium to invoke the button.

Juan Carlos Oropeza Over a year ago

How you install FizzlerEx? I check the link and there is a .zip but don't see any installer

IOviSpot Over a year ago

FizzlerEx link dead. Also, the github page seems outdated as hell, but is it?

jammykam Over a year ago

@wEight Yes, seems to be dead, stick with [HTML Agility Pack ](html-agility-pack.net)

Tickseeker · Accepted Answer · 2017-07-20 05:38:35Z

8

I am using AngleSharp and have been very satisfied with it.

Here is a simple example how to fetch a page:

var config = Configuration.Default.WithDefaultLoader();
var document = await BrowsingContext.New(config).OpenAsync("https://www.google.com");

And now you have a web page in document variable. Then you can easily access it by LINQ or other methods. For example if you want to get a string value from a HTML table:

var someStringValue = document.All.Where(m =>
        m.LocalName == "td" &&
        m.HasAttribute("class") &&
        m.GetAttribute("class").Contains("pid-1-bid")
    ).ElementAt(0).TextContent.ToString();

To use CSS selectors please see AngleSharp examples.

edited Jul 20, 2017 at 5:38

answered Jul 19, 2017 at 11:23

Tickseeker

1411 silver badge10 bronze badges

Comments

KyleMit · Accepted Answer · 2019-06-17 13:49:47Z

6

Here's an example of using the HttpWebRequest class to fetch a URL

private void buttonl_Click(object sender, EventArgs e) 
{ 
    String url = TextBox_url.Text;
    HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url); 
    HttpWebResponse response = (HttpWebResponse) request.GetResponse(); 
    StreamReader sr = new StreamReader(response.GetResponseStream()); 
    richTextBox1.Text = sr.ReadToEnd(); 
    sr.Close(); 
}

edited Jun 17, 2019 at 13:49

KyleMit♦

31.5k74 gold badges517 silver badges712 bronze badges

answered Jun 20, 2016 at 1:16

Mohamed Sayed

611 silver badge1 bronze badge

1 Comment

A J Over a year ago

you should add code in your answer instead of an image.

Ghan · Accepted Answer · 2020-08-19 15:31:35Z

4

You can use WebClient to download the html for any url. Once you have the html, you can use a third-party library like HtmlAgilityPack to lookup values in the html as in below code -

public static string GetInnerHtmlFromDiv(string url)
    {
        string HTML;
        using (var wc = new WebClient())
        {
            HTML = wc.DownloadString(url);
        }
        var doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(HTML);
        
        HtmlNode element = doc.DocumentNode.SelectSingleNode("//div[@id='<div id here>']");
        if (element != null)
        {
            return element.InnerHtml.ToString();
        }   
        return null;            
    }

answered Aug 19, 2020 at 15:31

Ghan

3512 silver badges15 bronze badges

Comments

youssef · Accepted Answer · 2016-12-10 21:03:47Z

Try this solution. It works fine.

 try{
        String url = textBox1.Text;
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        StreamReader sr = new StreamReader(response.GetResponseStream());
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.Load(sr);
        var aTags = doc.DocumentNode.SelectNodes("//a");
        int counter = 1;
        if (aTags != null)
        {
            foreach (var aTag in aTags)
            {
                richTextBox1.Text +=  aTag.InnerHtml +  "\n" ;
                counter++;
            }
        }
        sr.Close();
        }
        catch (Exception ex)
        {
            MessageBox.Show("Failed to retrieve related keywords." + ex);
        }

Collectives™ on Stack Overflow

How do I find the text within a div in the source of a web page using C#

7 Answers 7

3 Comments

3 Comments

5 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

3 Comments

3 Comments

5 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related