0

I have been trying to scrap information from a url in R using the rvest package:

url <-'https://eprocure.gov.in/cppp/tendersfullview/id%3DNDE4MTY4MA%3D%3D/ZmVhYzk5NWViMWM1NTdmZGMxYWYzN2JkYTU1YmQ5NzU%3D/MTUwMjk3MTg4NQ%3D%3D'

but am not able to correctly identity the xpath even after using selector plugin.

The code i am using for fetching the first table is as follows:

detail_data <- read_html(url)
detail_data_raw <- html_nodes(detail_data, xpath='//*[@id="edit-t-
fullview"]/table[2]/tbody/tr[2]/td/table')
detail_data_fine <- html_table(detail_data_raw)

When i try the above code, the detail_data_raw results in {xml_nodeset (0)} and consequently detail_data_fine is an empty list()

The information i am interested in scrapping is under the headers:

Organisation Details

Tender Details

Critical Dates

Work Details

Tender Inviting Authority Details

Any help or ideas in what is going wrong and how to rectify it is welcome.

3
  • Check your url first. It seems like you have bad url. Commented Aug 18, 2017 at 5:04
  • The URL works just fine when i paste to my browser. Is there something more i need to do to check this? Commented Aug 18, 2017 at 5:06
  • 1
    I would recommend to use RSelenium as this site returns dynamic html markup. The markup you parse with your function and the markup you get when you click the link are not the same. RSelenium should overcome this issue as it simulate users behavior. Commented Aug 18, 2017 at 5:33

2 Answers 2

2

Your example URL isn't working for anyone, but if you're looking to get the data for a particular tender, then:

library(rvest)
library(stringi)
library(tidyverse)

pg <- read_html("https://eprocure.gov.in/mmp/tendersfullview/id%3D2262207")

html_nodes(pg, xpath=".//table[@class='viewtablebg']/tr/td[1]") %>% 
  html_text(trim=TRUE) %>% 
  stri_replace_last_regex("\ +:$", "") %>% 
  stri_replace_all_fixed(" ", "_") %>% 
  stri_trans_tolower() -> tenders_cols

html_nodes(pg, xpath=".//table[@class='viewtablebg']/tr/td[2]") %>% 
  html_text(trim=TRUE) %>% 
  as.list() %>% 
  set_names(tenders_cols) %>% 
  flatten_df() %>% 
  glimpse()
## Observations: 1
## Variables: 15
## $ organisation_name            <chr> "Delhi Jal Board"
## $ organisation_type            <chr> "State Govt. and UT"
## $ tender_reference_number      <chr> "Short NIT. No.20 (Item no.1) EE ...
## $ tender_title                 <chr> "Short NIT. No.20 (Item no.1)"
## $ product_category             <chr> "Civil Works"
## $ tender_fee                   <chr> "Rs.500"
## $ tender_type                  <chr> "Open/Advertised"
## $ epublished_date              <chr> "18-Aug-2017 05:15 PM"
## $ document_download_start_date <chr> "18-Aug-2017 05:15 PM"
## $ bid_submission_start_date    <chr> "18-Aug-2017 05:15 PM"
## $ work_description             <chr> "Replacement of settled deep sewe...
## $ pre_qualification            <chr> "Please refer Tender documents."
## $ tender_document              <chr> "https://govtprocurement.delhi.go...
## $ name                         <chr> "EXECUTIVE ENGINEER (NORTH)-II"
## $ address                      <chr> "EXECUTIVE ENGINEER (NORTH)-II\r\...

seems to work just fine w/o installing Python and using Selenium.

Sign up to request clarification or add additional context in comments.

3 Comments

The links still seem to be working perfect...... But your solution does work for me..... assuming the data on the Central e-procurement is same as the mission mode projects portal.
We are still missing the cols on the right hand side like Product Sub-Category, EMD etc...... The ones which are on the right hand side of the individual tables. Any suggestions????
likely. I provided an example answer that shld be easy to extend upon. It's likely your link is working for you due to cookies/sessions/etc.
0

Have a look at 'dynamic webscraping'. Typically what happens is when you enter the url in your browser, it sends a get request to the host server. The host server builds an HTML page with all data in it and posts it back to you. In dynamic pages, the server just sends you a HTML template, which once you open, runs javascript in your browser, which then retrieves the data that populates the template.

I would recommend scraping this page using python and the Selenium library. Selenium library gives your program the ability to wait until the javascript has run in your browser and retrieved the data. See below a query I had on the same concept, and a very helpful reply

BeautifulSoup parser can't access html elements

3 Comments

I would recommend not suggesting another programming language when the OP posted this under the R tag and when R has equal or better scraping & HTML processing capability than the language you suggested.
Any other alternate way over Selenium??? Facing problems with starting the Selenium server. Tried both via R and the windows option provided in the vigenette.
I had a look and don't see any. Perhaps use this tutorial - r-bloggers.com/scraping-with-selenium - as a template, then tweak to your purposes

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.