0
library(rvest)

df <- data.frame(Links = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"))

for(i in 1:3) {
  webpage <- read_html(paste0("https://www.whatmobile.com.pk/", df$Links[i]))
  data <- webpage %>%
    html_nodes(".specs") %>%
    .[[1]] %>% 
    html_table(fill = TRUE)
}

want to make loop works for all 3 values in df$Links but above code just download the last one, and downloaded data must also be identical with variables (may be a new column with variables name)

2 Answers 2

1

The problem is in how you're structuring your for loop. It's much easier just to not use one in the first place, though, as R has great support for iterating over lists, like lapply and purrr::map. One version of how you could structure your data:

library(tidyverse)
library(rvest)

base_url <- "https://www.whatmobile.com.pk/"

models <- data_frame(model = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"),
           link = paste0(base_url, model),
           page = map(link, read_html))

model_specs <- models %>% 
    mutate(node = map(page, html_node, '.specs'),
           specs = map(node, html_table, header = TRUE, fill = TRUE),
           specs = map(specs, set_names, c('var1', 'var2', 'val1', 'val2'))) %>% 
    select(model, specs) %>% 
    unnest()

model_specs
#> # A tibble: 119 x 5
#>              model      var1       var2
#>              <chr>     <chr>      <chr>
#>  1 Qmobile_Noir-M6     Build         OS
#>  2 Qmobile_Noir-M6     Build Dimensions
#>  3 Qmobile_Noir-M6     Build     Weight
#>  4 Qmobile_Noir-M6     Build        SIM
#>  5 Qmobile_Noir-M6     Build     Colors
#>  6 Qmobile_Noir-M6 Frequency    2G Band
#>  7 Qmobile_Noir-M6 Frequency    3G Band
#>  8 Qmobile_Noir-M6 Frequency    4G Band
#>  9 Qmobile_Noir-M6 Processor        CPU
#> 10 Qmobile_Noir-M6 Processor    Chipset
#> # ... with 109 more rows, and 2 more variables: val1 <chr>, val2 <chr>

The data is still pretty messy, but at least it's all there.

Sign up to request clarification or add additional context in comments.

11 Comments

i guess header = FALSE its removing the first row, creating models vector would become difficult if df has more of records
The first row here is a description that is not as nested; put header = FALSE if you want to keep it (though it will be repeated because of the later row structure). The right way to create the model vector would be to scrape it from a menu.
#header = FALSE gives > # A tibble: 122 x 5 < thanks for your time
em when i increase variables to 10 it gives error any idea ? Error in mutate_impl(.data, dots) : Evaluation error: no applicable method for 'html_table' applied to an object of class "xml_missing".
You could tack on a selector with ,. For possibly, just replace html_table with possibly(html_table, as_data_frame(matrix(NA, ncol = 4)))—the replacement for when it errors has to have the same number of columns for set_names.
|
1

it is capturing all three values, but it writes over them with each loop. That's why it only shows one value, and that one value being for the last page

You need to initialise a variable first before you go into your loop, I suggest a list so you can store data for each successive loop. So something like

final_table <- list()

for(i in 1:3) {
   webpage <- read_html(paste0("https://www.whatmobile.com.pk/",   df$Links[i]))
   data <- webpage %>%
   html_nodes(".specs") %>%
   .[[1]] %>% 
html_table(fill= TRUE)

 final_table[[i]] <- data.frame(data, stringsAsFactors = F)
}

In this was, it appends new data to the list with each loop.

3 Comments

Growing a list with a for loop is really slow in R due to memory allocation. In this case it's likely that other parts of the code will be slower to the point where it doesn't matter, but it's still not a good idea to use this approach in other contexts. Some of the pain can be avoided by preallocation, e.g. final_table <- vector(3, mode = 'list'), though if what you're assigning is large, it may be insufficient.
I am doing something similar to OP and I use this method, and it is slow. Assuming it is the way the way R allocates memory. My list, however, collects around 5 - 7000 items by skipping through around 400 separate URLs. By your suggestion, should I preallocate my initial initialization of the list? ie final_table <- vector(7000, mode = 'list')?
Really the best way to do it is to not use a for loop at all—lapply or purrr::map (and variants) contain them internally in a well-structured way that saves the confusion that caused this problem. It won't necessarily be a lot faster in this case, though, as the bottleneck is likely the speed of the internet connection and servers. There's no simple way to vectorize the connections necessary, so web scraping is for the moment somewhat bound to be slow.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.