1

I am using R with RStudio. I am trying to scrape data from a specific webpage using the rvest package. Below is a partial screenshot of the webpage with the values I am interested to scrape circled in Red.

screenshot

I am completely new to this HTML and Element thing and I am a having a hard time trying to figure out on how to use the relevant html tags in rvest. Using Chrome DevTools, I have been able to figure out where each of the items I need are located in the HTML codes.

I am providing the tags relevant to each item below:

Table Headers:

<thead style="width: 547px; top: 0px; z-index: auto;" class="">
<tr class="hprt-table-header">
<th class="hprt-table-header-cell -first" style="width: 134px;">
Accommodation Type
</th>
<th class="hprt-table-header-cell hprt-table-header-price" style="width: 89px;">
Today's Price
</th>
<th class="hprt-table-header-cell hprt-table-header-policies" style="width: 146px;">
Your Choices</th>

Standard Queen Room:

<a class="hprt-roomtype-link" href="#RD27576901" data-room-id="27576901" id="room_type_id_27576901" data-room-name="" data-et-click="">
<span class="hprt-roomtype-icon-link ">
Standard Queen Room
</span>

MUR 13,097:

<div class="bui-price-display__value prco-inline-block-maker-helper prco-f-font-heading " aria-hidden="true" data-et-mouseenter="
customGoal:cCcCcCDUfcXIFbcDcbNXGDJae:2
goal:desktop_room_list_price_column_hover_over_price
">
MUR&nbsp;13,097
</div>

All-Inclusive:

" id="b_tt_holder_5" aria-describedby="materialized_tooltip_1n6pi">
<span class="bicon-allinclusive mp-icon meal-plan-icon"></span>
<span class="ungreen_keep_green">
All-Inclusive
</span>

Superior Queen Room:

<a class="hprt-roomtype-link" href="#RD27576902" data-room-id="27576902" id="room_type_id_27576902" data-room-name="" data-et-click="">
<span class="hprt-roomtype-icon-link ">
Superior Queen Room
</span>

14,266:

<div class="bui-price-display__value prco-inline-block-maker-helper prco-f-font-heading " aria-hidden="true" data-et-mouseenter="
customGoal:cCcCcCDUfcXIFbcDcbNXGDJae:2
goal:desktop_room_list_price_column_hover_over_price
">
MUR&nbsp;14,266
</div>

All-Inclusive:

" id="b_tt_holder_9" aria-describedby="materialized_tooltip_n2p5s">
<span class="bicon-allinclusive mp-icon meal-plan-icon"></span>
<span class="ungreen_keep_green">
All-Inclusive
</span>

I would like to transform the output into a data frame as follows:

 Accommodation Type      Today's Price  Your Choices
 Standard Queen Room      MUR 13,907    All-Inclusive
 Superior Queen Room      MUR 14,266    All-Inclusive 

My R codes currently stand as follows:

if (!require(rvest)) install.packages('rvest')

library(rvest)

url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")     

Any help would be highly appreciated.

2 Answers 2

1

This is not a complete solution, as this is a rather complex task.

In general: You can select html tags/nodes with html_nodes() and by specifying their class or id argument. In your case I see no ids but there are classes. IDs would be prefixed with a # for classes you use ., e.g. ".hprt-table-header" (as used in the code below.) The code for extraction the text is pretty similar for each chunks of info you are after - just modify the code below for those. An issue that might be a bit harder is to figure out the rows that have more than one value for the "prices" and "choices".

library(rvest)
#> Loading required package: xml2

url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&")     

Table Headers

url1 %>% 
  html_nodes(".hprt-table-header") %>% 
  html_text() %>% 
  strsplit("\n") %>% 
  unlist() %>% 
  .[. != ""] %>% 
  gsub("\n", "", .) %>% 
  .[-5]
#> [1] "Accommodation Type" "Sleeps"             "Today's price"     
#> [4] "Your choices"       "Quantity"

Room Type

url1 %>% 
  html_nodes(".hprt-roomtype-icon-link") %>% 
  html_text() %>% 
  strsplit("\n") %>% 
  unlist() %>% 
  .[. != ""]
#> [1] "Standard Queen Room" "Superior Queen Room" "Deluxe Family Room" 
#> [4] "Triple Room"

Price

url1 %>% 
  html_nodes(".bui-price-display__value") %>% 
  html_text() %>% 
  strsplit("\n") %>% 
  unlist() %>% 
  .[. != ""] %>% 
  gsub("\n", "", .) 
#> [1] "US$325" "US$241" "US$354" "US$270" "US$532" "US$447" "US$398" "US$313"

Note that before scraping big amounts of data from a website you should confirm that you are note putting yourself in legal jeopardy.

Sign up to request clarification or add additional context in comments.

Comments

0

Here is solution retrieving the table of prices and then performing some data cleaning:

Still requires some additional clean-up but the majority is done.

library(rvest)
library(dplyr)
library(stringr)

url1 <- read_html("https://www.booking.com/hotel/mu/tamassa.html?aid=356980;label=gog235jc-1DCAsonQFCE2hlcml0YWdlLWF3YWxpLWdvbGZIM1gDaJ0BiAEBmAExuAEXyAEM2AED6AEB-AECiAIBqAIDuAKiwqmEBsACAdICJGFkMTQ3OGU4LTUwZDMtNGQ5ZS1hYzAxLTc0OTIyYTRiZDIxM9gCBOACAQ;sid=729aafddc363c28a2c2c7379d7685d87;all_sr_blocks=36363601_246990918_2_85_0;checkin=2021-09-04;checkout=2021-09-05;dest_id=-1354779;dest_type=city;dist=0;from_beach_key_ufi_sr=1;group_adults=2;group_children=0;hapos=1;highlighted_blocks=36363601_246990918_2_85_0;hp_group_set=0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=36363601_246990918_2_85_0__29200;srepoch=1619681695;srpvid=51c8354f03be0097;type=total;ucfs=1&") 

output <- url1 %>% 
   html_nodes(xpath = './/table[@id="hprt-table"]')  %>% 
   html_table() %>% .[[1]]

    
#Fix column name
colnames(output)[5] <- "Quantity"

#Clean up columns
#remove repeating information in 2 columns
output2 <- output %>% mutate_at(c("Accommodation Type", "Today's price"), ~str_extract(., ".*\n"))
#Remove repeating newlines
answer<-output2 %>% mutate_all(str_squish)

answer
# A tibble: 8 x 5
  `Accommodation Ty… Sleeps           `Today's price` `Your choices`                                                                   Quantity                                                 
  <chr>              <chr>            <chr>           <chr>                                                                            <chr>                                                    
1 Triple Room        Max persons: 3   US$398          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$398) 2 (US$795) 3 (US$1,193) 4 (US$…
2 Triple Room        Max persons: 1 … US$313          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$313) 2 (US$626) 3 (US$939) 4 (US$1,…
3 Standard Queen Ro… Max persons: 2   US$325          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$325) 2 (US$650) 3 (US$976) 4 (US$1,…
4 Standard Queen Ro… Max persons: 1 … US$241          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$241) 2 (US$481) 3 (US$722) 4 (US$96…
5 Superior Queen Ro… Max persons: 2   US$354          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$354) 2 (US$708) 3 (US$1,063) 4 (US$…
6 Superior Queen Ro… Max persons: 1 … US$270          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$270) 2 (US$539) 3 (US$809) 4 (US$1,…
7 Deluxe Family Room Max persons: 2   US$532          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$532) 2 (US$1,064) 3 (US$1,596) 4 (U…
8 Deluxe Family Room Max persons: 1 … US$447          All-Inclusive FREE cancellation before 23:59 on 27 August 2021 More details on … Select rooms 0 1 (US$447) 2 (US$895) 3 (US$1,342) 4 (US$…

5 Comments

I get the following error when running the codes after "output" : Error in tables(.) : could not find function "tables"
When I change it to "table", I get following error message: Error in order(y) : unimplemented type 'list' in 'orderVector1'
Thanks for taking time to look into my issue. However, I can't seem to find your update. still getting the same error.
@user3115933, Yes, I needed to click save on the edit. Sorry my mistake, The line with the "tables" was not necessary but a leftover from an earlier draft.
Yes. Thanks a lot!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.