0

So I'm new to web scraping, and wanted to learn by trying to scrape the keurig website for fun, and extracting information about some of the k cups for sale. My goal is to go to the k-cups page, click on every k-cup and extract some information such as if it is caffeinated, the roast color, and maybe origin. I can tackle that stuff later, I'm having some trouble finding the CSS or finding a way to automate the process of clicking every object to get the extra info. I did this:

library(rvest)
keurig <- read_html("http://www.keurig.com/beverages/k-cup-pods")
# Grab the CSS Nodes from the website
keurig.html <- html_nodes(keurig, ".keurig_card")
keurig.text <- html_text(keurig.html)
# Print the text
keurig.text

I ended up getting a lot of tab and new line characters with some of the coffee names in between. How exactly would I scrape this data to grab the info about every k-cup?

1
  • 1
    If it is unstructured data (I cannot access Keurig.com at the moment), you would have to use Regular Expressions. see here for a good introduction: stackoverflow.com/documentation/r/1123/… I guess that regular expressions in connection with CSS-Tags might work quite well. Commented Jul 31, 2017 at 4:20

1 Answer 1

1

Use this to get the links for every item:

library(rvest)
keurig <- read_html("http://www.keurig.com/beverages/k-cup-pods")
keurig.html <- html_nodes(keurig, ".product_name")
links = html_attr(keurig.html, name = "href")

The class that contains the links to every item is product_name. Once you get the nodes, extract the href property.

Result (first four shown):

 [1] "/Beverages/Coffee/Regular/Breakfast-Blend-Coffee/p/Breakfast-Blend-Coffee-K-Cup-Green-Mountain"                          
 [2] "/Beverages/Coffee/Regular/Dark-Magic%C2%AE-Extra-Bold-Coffee/p/Dark-Magic-Extra-Bold-Coffee-K-Cup-Green-Mountain"        
 [3] "/Beverages/Coffee/Regular/The-Original-Donut-Shop%C2%AE-Coffee/p/Original-Donut-Shop-Extra-Bold-Coffee-K-Cup-CP"         
 [4] "/Beverages/Coffee/Regular/Nantucket-Blend%C2%AE-Coffee/p/Nantucket-Blend-Coffee-K-Cup-Green-Mountain"

Then use paste0 to create the link to each cake's details page:

paste0("http://www.keurig.com/beverages/k-cup-pods", 
       "/Beverages/Coffee/Regular/Breakfast-Blend-Coffee/p/Breakfast-Blend-Coffee-K-Cup-Green-Mountain")
Sign up to request clarification or add additional context in comments.

2 Comments

I tried to use this but when try to read the text inside the link with html_textI get the following error (Error in UseMethod("xml_text") : no applicable method for 'xml_text' applied to an object of class "character") any idea on how can I fix this?
@Rollo99 Try linkText = html_text(keurig.html)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.