Im scraping some newspaper-sites for articles related to the subject "the fourth indsutrial revolution.
It's supposed to open the site, log in, search for "fjerde industrielle revolution" (fourth industrial revolution in Danish), make all searchresults accessible, put all the headlines into a vector, iterate over all headlines in a function, and get the articles behind the behind the headlines.
I can enter the results one-by-one and scrape them but I need it to iterate over them in a function and scrape all the articles.
If you see places in the code that could be improved, please let me know.
Thank you in advance for any feedback. Anders
library(rvest)
library(RSelenium)
library(dplyr)
library(tidyverse)
library(tidytext)
rD <- rsDriver(browser=c("firefox"))
driver <- rD[["client"]]
driver$navigate("https://www.berlingske.dk/")
#Allow cokkies
Sys.sleep(1)
element <- driver$findElement(using = "link text", "Kun nødvendige")
Sys.sleep(1)
element$clickElement()
# Press log ind
Sys.sleep(1)
element <- driver$findElement("link text", "LOG IND") # This guy
#element$highlightElement()
element$clickElement()
Sys.sleep(1)
#Find username-element
element <- driver$findElement(using = "id", "email")
Sys.sleep(1)
#enter username
element$sendKeysToElement(list("*******@hotmail.com"))
Sys.sleep(1)
#Find password element
element <- driver$findElement(using = "id","password")
#Enter password
element$sendKeysToElement(list("**********", key= "enter"))
Sys.sleep(1)
#Find menu-icon and click it
element <- driver$findElement(using = "css selector", ".lp_nav_menu > ul:nth-child(3) > li:nth-child(5) > a:nth-child(1)")
element$clickElement()
Sys.sleep(1)
#selcet input box
element <- driver$findElement(using = "css", "#site-search")
#element$clickElement()
#send text to input box and search
Sys.sleep(1)
element$sendKeysToElement(list("fjerde industrielle revolution", key="enter"))
Sys.sleep(1)
tryCatch({
Sys.sleep(1)
suppressMessages({
loadmore <- driver$findElement("css selector", "button.btn:nth-child(1)")
while(loadmore$isElementDisplayed()[[1]]){
loadmore$clickElement()
Sys.sleep(1)
loadmore <- driver$findElement("css selector", "button.btn:nth-child(1)")
}
})
},
error = function(e) {
NA_character_
}
)
#Get headlines - works, makes list of headlines
element <- driver$findElements(using = "css selector", "h4:nth-child(2) > a:nth-child(1)")
headers <- unlist(lapply(element, function(x) {x$getElementText()})) %>% unique(element)
#Opens the first link
element <- driver$findElement(using="css selector", "h4:nth-child(2) > a:nth-child(1)")
element$clickElement()
```
This code are used after the searchresult have been entered
```
#Finds and gets headline
artikel1_overskrift <- driver$findElement(using="css", value=".article-header__title")
artikel1_overskrift <- artikel1_overskrift$getElementText()
#Finds and gets the intro
artikel_indledning <- driver$findElement(using="css", value="#articleHeader > p")
artikel_indledning <- artikel_indledning$getElementText()
#Finds and gets element holding date etc.
artikel_dato.m.m. <- driver$findElement(using="css", value=".col-lg-11")
artikel_dato.m.m. <- artikel_dato.m.m.$getElementText()
#Finds and gets body of article
artikel1_body <- driver$findElement(using = "css", value="#articleBody")
artikel1_body <- artikel1_body$getElementText()
#Edit
I want this function to iterate over a list containing the headers and compairing them to the headlines of the searchresults i.e. links to the articles, but r throws an error.
comparison (1) is possible only for atomic and list types
I*ve tried: tibble, data_frame, list and character_vector/atomic vector with no change.
Does somebody have a suggestion to what could possibly be causing the error?
for (i in seq_along(headers)) {
if (headers[i] == driver$findElement(using="css selector", value=".teaser__title-link")){
element <- driver$findElement(using="css selector", value=".teaser__title-link")
element$clickElement
} else {
print("No luck!")
}
}