0

I'm using below script to retrieve HTML from an URL.

string webURL = @"https://nl.wiktionary.org/wiki/" + word.ToLower();
                using (WebClient client = new WebClient())
                {
                      string htmlCode = client.DownloadString(webURL);                
                }

The variable word can be any word. In case there is no WIKI page for the "word" be retrieved the code is ending in error with code 404, while retrievng the URL with a browser opens a WIKI page, saying there is no page for this item yet.

What I want is that the code always gets the HTML, also when the WIKI page says there is no info yet. I do not want to avoid the error 404 with a try and catch.

Does anyone has an idea why this is not working with a Webclient?

2
  • 1
    a little bit off topic but: why not HttpClient instead of WebClient Commented Jul 7, 2017 at 13:07
  • I guess you have to add client.Headers Commented Jul 7, 2017 at 13:09

2 Answers 2

3

try this. You can catch the 404 error content in a try catch block.

        var word = Console.ReadLine();
        string webURL = @"https://nl.wiktionary.org/wiki/" + word.ToLower();
        using (WebClient client = new WebClient() {  })
        {
            try
            {

                string htmlCode = client.DownloadString(webURL);

            }
            catch (WebException exception)
            {
                string responseText=string.Empty;

                var responseStream = exception.Response?.GetResponseStream();

                if (responseStream != null)
                {
                    using (var reader = new StreamReader(responseStream))
                    {
                        responseText = reader.ReadToEnd();
                    }
                }

                Console.WriteLine(responseText);
            }
        }

        Console.ReadLine();
Sign up to request clarification or add additional context in comments.

1 Comment

Note that WebException.Response by default is limited to 64 kB. If you need to read more, you need to set HttpWebRequest.DefaultMaximumErrorResponseLength. (Thanks to stackoverflow.com/a/43842761/72809)
0

Since this WIKI-server use case-sensitive url mapping, just don't modify case of URL to harvest (remove ".ToLower()" from you code).

Ex.: Lower case:
https://nl.wiktionary.org/wiki/categorie:onderwerpen_in_het_nynorsk
Result: HTTP 404(Not Found)

Normal (unmodified) case:
https://nl.wiktionary.org/wiki/Categorie:Onderwerpen_in_het_Nynorsk
Result: HTTP 200(OK)

Also, keep in mind what most (if not all) WiKi servers (including this one) generates custom 404 pages, so in browser they looks like "normal" pages, but, despite this, they are serving with 404 http code.

1 Comment

Thx Pavel, so it looks like I will have to use a "try/cach" after all.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.