Web Scraping in R with loop from data.frame

Question

library(rvest)

df <- data.frame(Links = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"))

for(i in 1:3) {
  webpage <- read_html(paste0("https://www.whatmobile.com.pk/", df$Links[i]))
  data <- webpage %>%
    html_nodes(".specs") %>%
    .[[1]] %>% 
    html_table(fill = TRUE)
}

want to make loop works for all 3 values in df$Links but above code just download the last one, and downloaded data must also be identical with variables (may be a new column with variables name)

alistaire · Accepted Answer · 2017-07-04 17:26:50Z

1

The problem is in how you're structuring your for loop. It's much easier just to not use one in the first place, though, as R has great support for iterating over lists, like lapply and purrr::map. One version of how you could structure your data:

library(tidyverse)
library(rvest)

base_url <- "https://www.whatmobile.com.pk/"

models <- data_frame(model = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"),
           link = paste0(base_url, model),
           page = map(link, read_html))

model_specs <- models %>% 
    mutate(node = map(page, html_node, '.specs'),
           specs = map(node, html_table, header = TRUE, fill = TRUE),
           specs = map(specs, set_names, c('var1', 'var2', 'val1', 'val2'))) %>% 
    select(model, specs) %>% 
    unnest()

model_specs
#> # A tibble: 119 x 5
#>              model      var1       var2
#>              <chr>     <chr>      <chr>
#>  1 Qmobile_Noir-M6     Build         OS
#>  2 Qmobile_Noir-M6     Build Dimensions
#>  3 Qmobile_Noir-M6     Build     Weight
#>  4 Qmobile_Noir-M6     Build        SIM
#>  5 Qmobile_Noir-M6     Build     Colors
#>  6 Qmobile_Noir-M6 Frequency    2G Band
#>  7 Qmobile_Noir-M6 Frequency    3G Band
#>  8 Qmobile_Noir-M6 Frequency    4G Band
#>  9 Qmobile_Noir-M6 Processor        CPU
#> 10 Qmobile_Noir-M6 Processor    Chipset
#> # ... with 109 more rows, and 2 more variables: val1 <chr>, val2 <chr>

The data is still pretty messy, but at least it's all there.

answered Jul 4, 2017 at 17:26

alistaire

43.5k4 gold badges80 silver badges119 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Janjua Over a year ago

i guess header = FALSE its removing the first row, creating models vector would become difficult if df has more of records

alistaire Over a year ago

The first row here is a description that is not as nested; put header = FALSE if you want to keep it (though it will be repeated because of the later row structure). The right way to create the model vector would be to scrape it from a menu.

Janjua Over a year ago

#header = FALSE gives > # A tibble: 122 x 5 < thanks for your time

Janjua Over a year ago

em when i increase variables to 10 it gives error any idea ? Error in mutate_impl(.data, dots) : Evaluation error: no applicable method for 'html_table' applied to an object of class "xml_missing".

alistaire Over a year ago

You could tack on a selector with ,. For possibly, just replace html_table with possibly(html_table, as_data_frame(matrix(NA, ncol = 4)))—the replacement for when it errors has to have the same number of columns for set_names.

|

matthew matthee · Accepted Answer · 2017-07-04 18:59:13Z

1

it is capturing all three values, but it writes over them with each loop. That's why it only shows one value, and that one value being for the last page

You need to initialise a variable first before you go into your loop, I suggest a list so you can store data for each successive loop. So something like

final_table <- list()

for(i in 1:3) {
   webpage <- read_html(paste0("https://www.whatmobile.com.pk/",   df$Links[i]))
   data <- webpage %>%
   html_nodes(".specs") %>%
   .[[1]] %>% 
html_table(fill= TRUE)

 final_table[[i]] <- data.frame(data, stringsAsFactors = F)
}

In this was, it appends new data to the list with each loop.

answered Jul 4, 2017 at 18:59

matthew matthee

8610 bronze badges

3 Comments

alistaire Over a year ago

Growing a list with a for loop is really slow in R due to memory allocation. In this case it's likely that other parts of the code will be slower to the point where it doesn't matter, but it's still not a good idea to use this approach in other contexts. Some of the pain can be avoided by preallocation, e.g. final_table <- vector(3, mode = 'list'), though if what you're assigning is large, it may be insufficient.

matthew matthee Over a year ago

I am doing something similar to OP and I use this method, and it is slow. Assuming it is the way the way R allocates memory. My list, however, collects around 5 - 7000 items by skipping through around 400 separate URLs. By your suggestion, should I preallocate my initial initialization of the list? ie final_table <- vector(7000, mode = 'list')?

alistaire Over a year ago

Really the best way to do it is to not use a for loop at all—lapply or purrr::map (and variants) contain them internally in a well-structured way that saves the confusion that caused this problem. It won't necessarily be a lot faster in this case, though, as the bottleneck is likely the speed of the internet connection and servers. There's no simple way to vectorize the connections necessary, so web scraping is for the moment somewhat bound to be slow.

Collectives™ on Stack Overflow

Web Scraping in R with loop from data.frame

2 Answers 2

11 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related