How to scrape web data in R that requires clicking a link?

Question

So I'm new to web scraping, and wanted to learn by trying to scrape the keurig website for fun, and extracting information about some of the k cups for sale. My goal is to go to the k-cups page, click on every k-cup and extract some information such as if it is caffeinated, the roast color, and maybe origin. I can tackle that stuff later, I'm having some trouble finding the CSS or finding a way to automate the process of clicking every object to get the extra info. I did this:

library(rvest)
keurig <- read_html("http://www.keurig.com/beverages/k-cup-pods")
# Grab the CSS Nodes from the website
keurig.html <- html_nodes(keurig, ".keurig_card")
keurig.text <- html_text(keurig.html)
# Print the text
keurig.text

I ended up getting a lot of tab and new line characters with some of the coffee names in between. How exactly would I scrape this data to grab the info about every k-cup?

If it is unstructured data (I cannot access Keurig.com at the moment), you would have to use Regular Expressions. see here for a good introduction: stackoverflow.com/documentation/r/1123/… I guess that regular expressions in connection with CSS-Tags might work quite well. — Jan
– Jan, Commented Jul 31, 2017 at 4:20

R. Schifini · Accepted Answer · 2017-07-31 05:21:32Z

1

Use this to get the links for every item:

library(rvest)
keurig <- read_html("http://www.keurig.com/beverages/k-cup-pods")
keurig.html <- html_nodes(keurig, ".product_name")
links = html_attr(keurig.html, name = "href")

The class that contains the links to every item is product_name. Once you get the nodes, extract the href property.

Result (first four shown):

 [1] "/Beverages/Coffee/Regular/Breakfast-Blend-Coffee/p/Breakfast-Blend-Coffee-K-Cup-Green-Mountain"                          
 [2] "/Beverages/Coffee/Regular/Dark-Magic%C2%AE-Extra-Bold-Coffee/p/Dark-Magic-Extra-Bold-Coffee-K-Cup-Green-Mountain"        
 [3] "/Beverages/Coffee/Regular/The-Original-Donut-Shop%C2%AE-Coffee/p/Original-Donut-Shop-Extra-Bold-Coffee-K-Cup-CP"         
 [4] "/Beverages/Coffee/Regular/Nantucket-Blend%C2%AE-Coffee/p/Nantucket-Blend-Coffee-K-Cup-Green-Mountain"

Then use paste0 to create the link to each cake's details page:

paste0("http://www.keurig.com/beverages/k-cup-pods", 
       "/Beverages/Coffee/Regular/Breakfast-Blend-Coffee/p/Breakfast-Blend-Coffee-K-Cup-Green-Mountain")

answered Jul 31, 2017 at 5:21

R. Schifini

9,3532 gold badges31 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Rollo99 Over a year ago

I tried to use this but when try to read the text inside the link with html_textI get the following error (Error in UseMethod("xml_text") : no applicable method for 'xml_text' applied to an object of class "character") any idea on how can I fix this?

R. Schifini Over a year ago

@Rollo99 Try linkText = html_text(keurig.html)

Collectives™ on Stack Overflow

How to scrape web data in R that requires clicking a link?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related