0

I have a dataset on a Git Hub page. I imported them to Rstudio as a CSV file and created an array of URLs called "StoryLink" Now I want to web scrape data from each of these web pages. So I created a for loop and assign all of the collected data to a variable called "articleText" and converted it to a character array called "ArticlePage"

My problem is that even though I created a for loop it only web scrape the last web page (6th article) on the list of URLs. how do I web scrape all the URLs?

library(rvest)
library(dplyr)

GitHubpoliticsconversions<-  "https://raw.githubusercontent.com/lukanius007/web_scraping_politics/main/politics_conversions.csv"

CSVFile <- read.csv(GitHubpoliticsconversions, header = TRUE, sep = ",")

StoryLink <- c(pull(CSVFile, 4))

page <- {}

for(i in 1:6){
page[i] <- c(StoryLink[i])

ArticlePage <- read_html(page[i]) 

articleText = ArticlePage %>% html_elements(".lead , .article__title") %>% html_text()
PoliticalArticles <- c(articleText)

}

This is the result I got from this code but I need the same from all web pages

>PoliticalArticles
[1] "Wie es zur Hausdurchsuchung bei Finanzminister Blümel kam"                                                                                                                                 
[2] "Die Novomatic hatte den heutigen Finanzminister 2017 um Hilfe bei Problemen im Ausland gebeten – und eine Spende für die ÖVP angeboten. Eine solche habe er nicht angenommen, sagt Blümel."
>
1
  • 2
    Does this help? stackoverflow.com/a/27153589/6851825 You are currently creating one object PolitialArticles, and overwriting it with each loop iteration. At the end it is just the most recently assigned iteration. Commented Oct 11, 2021 at 15:28

1 Answer 1

1

You need to store your retrieved website data in a data format that can grow progressively e.g. a list.

You can assign elements to a (previously created) list in for loops by utilising the i as your list naming. In the example below we simply store the result of each 2*i calculation in the data_list. Resulsts can then be retrieved by simply accessing the list element e.g. data_list[1]

data_list <- list()

for (i in 1:10) {
data_list[i] <- 2*i
}

data_list

data_list[1]

In your example, you can do exactly the same. N.b. I have slightly altered your code and simplified it. I iterate through your website list, so i is basically each weburl. Results are then stored as outlined above in a list that progressively grows in size and can be accessed via pages[1], or the respective url pages["https://www.diepresse.com/5958204"]

library(rvest)
library(dplyr)

GitHubpoliticsconversions<-  "https://raw.githubusercontent.com/lukanius007/web_scraping_politics/main/politics_conversions.csv"

CSVFile <- read.csv(GitHubpoliticsconversions, header = TRUE, sep = ",")

StoryLink <- c(pull(CSVFile, 4))

pages <- list()

for(i in StoryLink){

ArticlePage <- read_html(i)

articleText = ArticlePage %>% html_elements(".lead , .article__title") %>% html_text()
pages[[i]] <- c(articleText)

}
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much. This way is perfect because not only I can extract text as a whole but also separately article vice and the list can be allowed to grow progressively, I am new to programming and thank you very much for clarifing in an understandable way

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.