Web Scraping using rvest in R

Question

I have been trying to scrap information from a url in R using the rvest package:

url <-'https://eprocure.gov.in/cppp/tendersfullview/id%3DNDE4MTY4MA%3D%3D/ZmVhYzk5NWViMWM1NTdmZGMxYWYzN2JkYTU1YmQ5NzU%3D/MTUwMjk3MTg4NQ%3D%3D'

but am not able to correctly identity the xpath even after using selector plugin.

The code i am using for fetching the first table is as follows:

detail_data <- read_html(url)
detail_data_raw <- html_nodes(detail_data, xpath='//*[@id="edit-t-
fullview"]/table[2]/tbody/tr[2]/td/table')
detail_data_fine <- html_table(detail_data_raw)

When i try the above code, the detail_data_raw results in {xml_nodeset (0)} and consequently detail_data_fine is an empty list()

The information i am interested in scrapping is under the headers:

Organisation Details

Tender Details

Critical Dates

Work Details

Tender Inviting Authority Details

Any help or ideas in what is going wrong and how to rectify it is welcome.

The URL works just fine when i paste to my browser. Is there something more i need to do to check this? — Shikhar Parashar
– Shikhar Parashar, Commented Aug 18, 2017 at 5:06
I would recommend to use RSelenium as this site returns dynamic html markup. The markup you parse with your function and the markup you get when you click the link are not the same. RSelenium should overcome this issue as it simulate users behavior. — Aleksandr
– Aleksandr, Commented Aug 18, 2017 at 5:33

hrbrmstr · Accepted Answer · 2017-08-18 11:58:04Z

2

Your example URL isn't working for anyone, but if you're looking to get the data for a particular tender, then:

library(rvest)
library(stringi)
library(tidyverse)

pg <- read_html("https://eprocure.gov.in/mmp/tendersfullview/id%3D2262207")

html_nodes(pg, xpath=".//table[@class='viewtablebg']/tr/td[1]") %>% 
  html_text(trim=TRUE) %>% 
  stri_replace_last_regex("\ +:$", "") %>% 
  stri_replace_all_fixed(" ", "_") %>% 
  stri_trans_tolower() -> tenders_cols

html_nodes(pg, xpath=".//table[@class='viewtablebg']/tr/td[2]") %>% 
  html_text(trim=TRUE) %>% 
  as.list() %>% 
  set_names(tenders_cols) %>% 
  flatten_df() %>% 
  glimpse()
## Observations: 1
## Variables: 15
## $ organisation_name            <chr> "Delhi Jal Board"
## $ organisation_type            <chr> "State Govt. and UT"
## $ tender_reference_number      <chr> "Short NIT. No.20 (Item no.1) EE ...
## $ tender_title                 <chr> "Short NIT. No.20 (Item no.1)"
## $ product_category             <chr> "Civil Works"
## $ tender_fee                   <chr> "Rs.500"
## $ tender_type                  <chr> "Open/Advertised"
## $ epublished_date              <chr> "18-Aug-2017 05:15 PM"
## $ document_download_start_date <chr> "18-Aug-2017 05:15 PM"
## $ bid_submission_start_date    <chr> "18-Aug-2017 05:15 PM"
## $ work_description             <chr> "Replacement of settled deep sewe...
## $ pre_qualification            <chr> "Please refer Tender documents."
## $ tender_document              <chr> "https://govtprocurement.delhi.go...
## $ name                         <chr> "EXECUTIVE ENGINEER (NORTH)-II"
## $ address                      <chr> "EXECUTIVE ENGINEER (NORTH)-II\r\...

seems to work just fine w/o installing Python and using Selenium.

answered Aug 18, 2017 at 11:58

hrbrmstr

79.1k11 gold badges146 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Shikhar Parashar Over a year ago

The links still seem to be working perfect...... But your solution does work for me..... assuming the data on the Central e-procurement is same as the mission mode projects portal.

Shikhar Parashar Over a year ago

We are still missing the cols on the right hand side like Product Sub-Category, EMD etc...... The ones which are on the right hand side of the individual tables. Any suggestions????

hrbrmstr Over a year ago

likely. I provided an example answer that shld be easy to extend upon. It's likely your link is working for you due to cookies/sessions/etc.

matthew matthee · Accepted Answer · 2017-08-18 06:39:45Z

0

Have a look at 'dynamic webscraping'. Typically what happens is when you enter the url in your browser, it sends a get request to the host server. The host server builds an HTML page with all data in it and posts it back to you. In dynamic pages, the server just sends you a HTML template, which once you open, runs javascript in your browser, which then retrieves the data that populates the template.

I would recommend scraping this page using python and the Selenium library. Selenium library gives your program the ability to wait until the javascript has run in your browser and retrieved the data. See below a query I had on the same concept, and a very helpful reply

BeautifulSoup parser can't access html elements

answered Aug 18, 2017 at 6:39

matthew matthee

8610 bronze badges

3 Comments

hrbrmstr Over a year ago

I would recommend not suggesting another programming language when the OP posted this under the R tag and when R has equal or better scraping & HTML processing capability than the language you suggested.

Shikhar Parashar Over a year ago

Any other alternate way over Selenium??? Facing problems with starting the Selenium server. Tried both via R and the windows option provided in the vigenette.

matthew matthee Over a year ago

I had a look and don't see any. Perhaps use this tutorial - r-bloggers.com/scraping-with-selenium - as a template, then tweak to your purposes

Collectives™ on Stack Overflow

Web Scraping using rvest in R

2 Answers 2

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related