0

I am attempting to scrape data from a website using the following code:

XML::htmlParse(GET("https://www.cagematch.net/?id=1&nr=283492"))

However, I receive the following error message:

Error in curl::curl_fetch_memory(url, handle = handle) : 
  Error while processing content unencoding: incorrect data check

I have checked the robots.txt file and scraping is allowable. I am able to scrape other websites with no problem.

Is the problem on their server or am I overlooking something? Is there code that will allow me to bypass this error?

Any help would be appreciated!

1 Answer 1

2

XML::htmlParse parses html in the form of a character string. You are trying to get XML::htmlParse to parse an object of class "response" from the httr package. XML doesn't know what to do with this object.

However, the error you are getting seems to be a curl error rather than an XML error. Depending on your platform, it may be easiest to just use a different method to obtain the html.

Instead you can try:

url <- "https://www.cagematch.net/?id=1&nr=283492"
XML::htmlParse(paste(readLines(url), collapse = "\n"))
#> <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> <head>
#> <link href="/2k16/css/2k16.css?20200712" rel="stylesheet" type="text/css">
#> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
#> <meta http-equiv="Content-Language" content="en">
#> <meta http-equiv="Content-Security-Policy-Report-Only" content="require-trusted-types-for 'script'; default-src 'self' fonts.gstatic.com; script-src 'self' fonts.gstatic.com; connect-src 'self'; img-src 'self' www.paypalobjects.com; style-src 'unsafe-inline';base-uri 'self';form-action 'self';object-src 'none'">
#> <meta name="viewport" content="width=1120">
#> <meta name="description" content="Internet Wrestling Database">
#> <meta name="keywords" content="wrestling,wwe,raw,smackdown,wrestlemania,aew,dynamite,impact,tna,wcw,ecw,roh,wwf,njpw,ajpw,puroresu,wrestling database,wrestling news,wrestler,superstar">
#> <meta name="author" content="Philip Kreikenbohm">
#> <title>ATP « Events Database « CAGEMATCH - The Internet Wrestling Database</title>
#> <script language="JavaScript" id="erasable" type="text/javascript" defer>
#> window.onload = function() { loadComments("commentBox", "1", "283492", "", "en"); }
#> </script>
#> </head>
#> <body class="TemplateBody">
#>         <div class="LayoutUserAccount LayoutWidth">
#> <a href="https://www.cagematch.net/de/"><img src="/2k16/img/german.png" class="LayoutLanguage" alt="Deutsch" title="Deutsch"></a><a href="https://www.cagematch.net/en/"><img src="/2k16/img/english.png" class="LayoutLanguage" alt="English" title="English"></a><div class="UserHeader">Not logged in or registered. | <a href="?id=872">Log In</a> | <a href="?id=871">Register</a> | <a href="?id=879">Password lost?</a>
#> </div>
#> </div>
#>         <div class="LayoutHeader">
#>             <div class="LayoutMainHeader LayoutWidth">
#>                 <div class="LayoutLogo">
#> <a href="?id="><img class="HeaderLogoLeft" src="/2k16/img/header/header2.webp" alt="CAGEMATCH Logo" style="width:570px;height:100px;" title="CAGEMATCH"></a>
#> </div>
#>                 <div class="LayoutSearch"><form action="" method="get" id="Search">
#> <input name="id" type="hidden" value="666"><input type="text" name="search" class="HeaderSearchInput" value="Search the site..." onclick="changeStateHeaderSearchBar(this,'en')" onblur="changeStateHeaderSearchBar(this,'en')"><input type="submit" class="HeaderSearchButton" value="Search">
#> </form></div>

...etc
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.