0

While this code for scraping prices from a webshop has worked perfectly fine for me over the last months, today I just got the following error message:

Error in curl::curl_fetch_memory(url, handle = handle) : 


Could not resolve host: NA

The code i use is as follows:

This part is for getting the full url's:

   #Scrape Galaxus
vec_galaxus<-vector()
i=0

input_galaxus <- input %>%
  filter(`Galaxus Artikel`!=0)


input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus$`Galaxus Artikel`)

This is the scraping loop:

sess <- session(input_galaxus2[1])             #to start the session
for (j in input_galaxus2){
  sess <- sess %>% session_jump_to(j)         #jump to URL
  
  i=i+1
  try(vec_galaxus[i] <- read_html(sess) %>%   #can read direct from sess
        html_nodes('.sc-1aeovxo-1.gvrGle') %>%
        html_text()%>%
        str_extract("[0-9]+") %>%
        as.integer())
  Sys.sleep(runif(1, min=0.2, max=0.5))
}

where part of my input "input_galaxus2" looks like this:

c("https://www.galaxus.ch/15758734", "https://www.galaxus.ch/7362734", 
"https://www.galaxus.ch/12073455", "https://www.galaxus.ch/20841274", 
"https://www.galaxus.ch/20589944 ", "https://www.galaxus.ch/13595276", 
"https://www.galaxus.ch/16255768", "https://www.galaxus.ch/6296373", 
"https://www.galaxus.ch/14513900", "https://www.galaxus.ch/14465626", 
"https://www.galaxus.ch/10592707", "https://www.galaxus.ch/19958785", 
"https://www.galaxus.ch/9858343", "https://www.galaxus.ch/14513913")

Does anybody know why suddenly this code gives me the above error message? Thanks in advance for your responses!

2
  • Momentary network glitch? Commented Jul 12, 2022 at 11:42
  • I don't really know. When I look at the outcoming vector which containes the prices of the products i want to scrape, it seems to have worked for the first 4 products. Commented Jul 12, 2022 at 12:18

2 Answers 2

1

If it were a different error, I'd think it was throttling, but this error does not really support that. However, to troubleshoot that (and you hitting too-many-hits limits on the server), try introducing a delay between pulls, perhaps a few seconds or a minute, just to see if that resolves things.

Here's a method that will allow to you repeat your code until all URLs are pulled without error. Note that this may also need the "delay" I suggested above in order to not anger the server admins on the remote end (or firewall or whatever).

  1. Create a list in which we'll store the results. Run this code only once, all the remaining bullets in the list should be repeatable without consequence.

    out <- vector("list", length(input_galaxus2))
    
  2. Prep the session. This may be repeatable depending on if you have authentication or other attributes.

    sess <- session(input_galaxus2[1])             #to start the session
    
  3. Iterate over the empty elements of your URLs and query as needed. If you get any errors, feel free to wait a little bit and re-run this code. If a particular URL succeeded, it will not be re-attempted, so repeat as needed, eventually (assuming the failures are intermittent and all URLs are value) you will get all results.

    I don't think you need read_html in this pipe, but I'm not testing for fear of "slashdotting" the website. The point of this answer is to suggest a mechanism that allows you to reattempt efficiently.

    empties <- which(sapply(out, is.null))
    for (i in empties) {
      res <- tryCatch({
        sess %>%
          session_jump_to(input_galaxus2[i]) %>%
          html_nodes('.sc-1aeovxo-1.gvrGle') %>%
          html_text() %>%
          str_extract("[0-9]+") %>%
          as.integer()
      }, error = function(e) e)
      if (inherits(res, "error")) {
        warning(sprintf("failed (%i, %s): %s", i, input_galaxus2[i], conditionMessage(e)))
        # optional
        Sys.sleep(3)
      } else out[[i]] <- res
    }
    

    Note: this assumes that a NULL value means the previous attempt failed, was interrupted, or ... was not attempted. If NULL can be a valid and successful return value from your pull, then you should likely prefill out with some other "canary" value: choose something that you are more confident will "never" appear in real results, and change how you define empties above.

Sign up to request clarification or add additional context in comments.

1 Comment

Hey, thanks for you answer. Altough I will probably alter the code in the near future, this definetly helped i the short run.
0

Using purrr::map instead of loop, without any Sys.sleep().

library(tidyverse)
library(rvest)

df <- tibble(
  links = c("https://www.galaxus.ch/15758734", "https://www.galaxus.ch/7362734", 
            "https://www.galaxus.ch/12073455", "https://www.galaxus.ch/20841274", 
            "https://www.galaxus.ch/20589944 ", "https://www.galaxus.ch/13595276", 
            "https://www.galaxus.ch/16255768", "https://www.galaxus.ch/6296373", 
            "https://www.galaxus.ch/14513900", "https://www.galaxus.ch/14465626", 
            "https://www.galaxus.ch/10592707", "https://www.galaxus.ch/19958785", 
            "https://www.galaxus.ch/9858343", "https://www.galaxus.ch/14513913")
)

get_prices <- function(link) {
  link %>% 
    read_html() %>%
    html_nodes(".sc-1aeovxo-1.gvrGle") %>%
    html_text2() %>% 
    str_remove_all("–")
}

df %>%  
  mutate(price= map(links, get_prices) %>% 
           as.numeric) 

# A tibble: 14 × 2
   links                              price
   <chr>                              <dbl>
 1 "https://www.galaxus.ch/15758734"   17.8
 2 "https://www.galaxus.ch/7362734"   500. 
 3 "https://www.galaxus.ch/12073455"  173  
 4 "https://www.galaxus.ch/20841274"  112  
 5 "https://www.galaxus.ch/20589944 "  25.4
 6 "https://www.galaxus.ch/13595276"  313  
 7 "https://www.galaxus.ch/16255768"   40  
 8 "https://www.galaxus.ch/6296373"    62.9
 9 "https://www.galaxus.ch/14513900"  539  
10 "https://www.galaxus.ch/14465626"  466. 
11 "https://www.galaxus.ch/10592707"   63.5
12 "https://www.galaxus.ch/19958785"   NA  
13 "https://www.galaxus.ch/9858343"     7.3
14 "https://www.galaxus.ch/14513913"  617  

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.