How to create a "for loop" in R which can web scrape data from each URL from a list of URLs?

Question

I have a dataset on a Git Hub page. I imported them to Rstudio as a CSV file and created an array of URLs called "StoryLink" Now I want to web scrape data from each of these web pages. So I created a for loop and assign all of the collected data to a variable called "articleText" and converted it to a character array called "ArticlePage"

My problem is that even though I created a for loop it only web scrape the last web page (6th article) on the list of URLs. how do I web scrape all the URLs?

library(rvest)
library(dplyr)

GitHubpoliticsconversions<-  "https://raw.githubusercontent.com/lukanius007/web_scraping_politics/main/politics_conversions.csv"

CSVFile <- read.csv(GitHubpoliticsconversions, header = TRUE, sep = ",")

StoryLink <- c(pull(CSVFile, 4))

page <- {}

for(i in 1:6){
page[i] <- c(StoryLink[i])

ArticlePage <- read_html(page[i]) 

articleText = ArticlePage %>% html_elements(".lead , .article__title") %>% html_text()
PoliticalArticles <- c(articleText)

}

This is the result I got from this code but I need the same from all web pages

>PoliticalArticles
[1] "Wie es zur Hausdurchsuchung bei Finanzminister Blümel kam"                                                                                                                                 
[2] "Die Novomatic hatte den heutigen Finanzminister 2017 um Hilfe bei Problemen im Ausland gebeten – und eine Spende für die ÖVP angeboten. Eine solche habe er nicht angenommen, sagt Blümel."
>

Does this help? stackoverflow.com/a/27153589/6851825 You are currently creating one object PolitialArticles, and overwriting it with each loop iteration. At the end it is just the most recently assigned iteration. — Jon Spring
– Jon Spring, Commented Oct 11, 2021 at 15:28

krenz · Accepted Answer · 2021-10-11 16:46:59Z

1

You need to store your retrieved website data in a data format that can grow progressively e.g. a list.

You can assign elements to a (previously created) list in for loops by utilising the i as your list naming. In the example below we simply store the result of each 2*i calculation in the data_list. Resulsts can then be retrieved by simply accessing the list element e.g. data_list[1]

data_list <- list()

for (i in 1:10) {
data_list[i] <- 2*i
}

data_list

data_list[1]

In your example, you can do exactly the same. N.b. I have slightly altered your code and simplified it. I iterate through your website list, so i is basically each weburl. Results are then stored as outlined above in a list that progressively grows in size and can be accessed via pages[1], or the respective url pages["https://www.diepresse.com/5958204"]

library(rvest)
library(dplyr)

GitHubpoliticsconversions<-  "https://raw.githubusercontent.com/lukanius007/web_scraping_politics/main/politics_conversions.csv"

CSVFile <- read.csv(GitHubpoliticsconversions, header = TRUE, sep = ",")

StoryLink <- c(pull(CSVFile, 4))

pages <- list()

for(i in StoryLink){

ArticlePage <- read_html(i)

articleText = ArticlePage %>% html_elements(".lead , .article__title") %>% html_text()
pages[[i]] <- c(articleText)

}

answered Oct 11, 2021 at 16:46

krenz

5302 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

manoj rasika Over a year ago

Thank you very much. This way is perfect because not only I can extract text as a whole but also separately article vice and the list can be allowed to grow progressively, I am new to programming and thank you very much for clarifing in an understandable way

Collectives™ on Stack Overflow

How to create a "for loop" in R which can web scrape data from each URL from a list of URLs?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related