3

I am trying to create a dataframe of color IDs, description, and dates from this site, which takes day and month input through dropdown menus and returns, I think, a dynamic JS generated page. I'm new to coding and thought this would be a fun toy project. I'd like to use RSelenium to automate the dropdown selection, and rvest to scrape the generated content. The data frame structure I'm hoping for will look like:

description, date, meta
"paragraph about birthday", Jun 01, "DAFFODIL PANTONE 17-1512 POWERFUL KNOWING EXPRESSIVE"

I'm attempting to first use a for loop to just iterate through each month of the year on a single day then work my way up to get every day for every month.

I'm stuck on simply getting the loop to iterate through each month, and getting the content. I could use some conceptual help first on this part of the task and appreciate any insight!

library(RSelenium)
library(rvest)
library(tidyverse)
library(xml2)

## first run: docker run -d -p 4445:4444 selenium/standalone-chrome
## open a new connection to Chrome
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
                                 port = 4445L,
                                 browserName = "chrome")

remDr$open()
remDr$navigate("https://www.pantone.com/pages/iphone/iphone_colorstrology.html#___1__") #Entering our URL gets the browser to navigate to the page
remDr$screenshot(display = TRUE) 

#### create list of month/days
 month_day<- read_html(remDr$getPageSource()[[1]])
 page_i <- month_day %>%
   html_nodes(".list") %>%
   html_children() %>% 
   html_text()

months <- page_i[1:12]
months <- (paste("'", months,"'", sep=''))
days <- page_i[13:43]
days <- as.numeric(days)


## create an object for month xpath elements
for (m in months){
  elements <- paste0("//option[contains(text(),",months,")]")
}

## attempt at loop

total <- data.frame()

for (e in elements){
remDr$navigate("https://www.pantone.com/pages/iphone/iphone_colorstrology.html#___1__") 
      print(e)
      month <- remDr$findElement(using = 'xpath', e)
      month$clickElement()
      day <- remDr$findElement(using = 'xpath', "//select[@id='lstDay']//option[5]") ## arbitrarily picking the 5th of each month
      day$clickElement()
      submit <- remDr$findElement(using = 'xpath', "/html[1]/body[1]/form[1]/div[1]/a[1]")
      submit$clickElement()
      html <- read_html(remDr$getPageSource()[[1]])
      description <- html %>%  html_nodes(xpath = "//tr//tr[2]//td[1]") %>% html_text() %>% gsub("^\\s+|\\s+$", "", .)
      meta <- html %>% html_nodes(xpath = "//td[@id='tdBg']") %>%  html_text() %>% gsub("^\\s+|\\s+$", "", .) 
      date <- html %>% html_nodes(xpath = "//td[@id='bgHeaderDate']//div") %>%  html_text() %>% gsub("^\\s+|\\s+$", "", .)
      df <- data.frame(cbind(description,meta,date))
      total <- rbind(total, df)
}

Not getting any errors but the results are unexpected each time. Either it would repeat on a single month/day combination like Jan05 * 12 times or jan05 * 3 times, Apr 05 *3 times, etc.

2
  • Does it have to be with selenium? (I get that this is a project for you to learn) Commented Jul 14, 2019 at 0:27
  • Absolutely does not have to be selenium. If there's a pure R or tidyverse solution, I'm all ears. Commented Jul 14, 2019 at 0:58

2 Answers 2

4

I will come back and update this to pick up on my suggestions. Navigate to that page then open the dev tools in a browser, say Chrome, with F12 and go to the network tab. Then, select a month and date and hit View Now. You will see traffic appear in the network tab. The page makes a POST xhr request to get the content you see after clicking the view icon.

enter image description here

The POST request itself is very simple and has a body (form) that comprises of the month and the day you selected:

enter image description here

So, you can mimic that POST request and then parse the response. An example for the date you mentioned could be:

library(rvest)

body <- list('month' = 6,'day' = 1)
url <- 'https://www.pantone.com/pages/iphone/iphone_colorstrology_results.aspx'
page <- html_session(url) %>%
  rvest:::request_POST(url, body = body, encode = "form") %>%
  read_html()

date <- page %>% html_node('table table td') %>% html_text() %>% 
  gsub('^\\s+|\\s+$|[\r\n\t]', '', .)
description <- page %>% html_node('tr:nth-of-type(2) div') %>% html_text() %>% 
  gsub('^\\s+|\\s+$|[\r\n\t]', '', .)
meta <- page %>% html_nodes('#tdBg span') %>% html_text()

df <- data.frame(date, description, meta)

Now, and this is what I will revisit later, the above could be converted into a function which returns a list or df that can be combined into a final dataframe. You could generate each body in advance and pass as an argument to the function. I would look at using a Session object, http Session, for the efficiency of re-using the current connection. The month and days could be updated in the form body during a loop/nestd loop - depending on how they are too be generated. I am new to R and know it doesn't have dictionaries but perhaps it has named lists, or some such, whereby you can scrape month: possible values associations from the original page to use in looping. I would welcome learning from more experienced R people how the above might be achieved - there are some gaps in my R knowledge to complete this to address today. Someone may post an answer along similar lines which would be helpful.


Generating the POST request bodies:

Looking at the dropdowns it is for a standard year so you can generate the required POST bodies in a nested for loop. I use 1,12 for months and lubridate to return days in month based on standard year:

library(lubridate)

for(i in seq(1,12)){
  date <- as.Date(gsub('placeholder',i, "2019-placeholder-01"), "%Y-%m-%d")
  days <- days_in_month(date)[[1]]
  for(j in seq(1,days)){
    body = list('month' = i,'day' = j)
    # pass body to function or add to an iterable for later looping
  }
}
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for the suggestions. I like the clean rvest pattern and recommendations. Avoiding xpath is also a nice approach as well. I am still trying to understand how to format the month/day combinations and the actual loop. Will keep trying.
I haven’t finished with it yet. Just I will need to teach myself how to do the next bit in R. Can do it in other languages. Think I have the main pieces.
The day month combinations are already correct for going the POST in the loop. I just need to complete the conversion of the top bit to a function that can work with Session.
1

Found a reasonable solution! It's not perfect but it get's me a lot closer than I was before. I ended up writing a function per @QHarr's suggestion and using their rvest pattern:

library(rvest)

colorstrology <- function(i,j){

  body <- list('month' = i,'day' = j)
  url <- 'https://www.pantone.com/pages/iphone/iphone_colorstrology_results.aspx'
  page <- html_session(url) %>%
    rvest:::request_POST(url, body = body, encode = "form") %>%
    read_html()

  date <- page %>% html_node('table table td') %>% html_text() %>% 
    gsub('^\\s+|\\s+$|[\r\n\t]', '', .)
  description <- page %>% html_node('tr:nth-of-type(2) div') %>% html_text() %>% 
    gsub('^\\s+|\\s+$|[\r\n\t]', '', .)
  meta <- page %>% html_nodes('#tdBg span') %>% html_text()

  df <- data.frame(date, description, meta)
}



months <- c(1:12)
days <- c(1:31)

df <- data.frame(date, description, meta)
for (m in months){
  for (d in days){
    temp <- colorstrology(m,d)
    df <- rbind(temp, df)
}
}



Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.