0

I'm trying to get out what's written in comment of following HTML code snippet, this is only a part of that code:

<table id="datalist1" cellspacing="0" border="0" style="border-width:1px;border-style:solid;width:100%;border-collapse:collapse;">
<tr>
    <td style="font-size:7pt;">
                                            <table width="100%" border="0" cellspacing="0" cellpadding="0">
                                                <tr align="left">
                                                    <td width="50%" class="subhead1">
                                                        <!-- <b>IE CODE : 0514026049</b> --> ' I want text inside this comment

                                                    </td>
                                                    <td rowspan="9" valign="top">
                                                        <span id="datalist1_ctl00_lbl_p"></span>
                                                    </td>
                                                </tr>

I am trying the following approach

1) Get Xpath of element.

2) Read Web_page

3) Go to comment node

4) extract text in comment

  library(rvest)
  library(xml2)

  url <- 'http://agriexchange.apeda.gov.in/ExportersDirectory/exporters_list.aspx?letter=Z'
  webpage <- read_html(url)
    ' Xpath of comment element I want to grab
    //*[@id="datalist1"]/tbody/tr[1]/td/table/tbody/tr[1]/td[1]/comment()



  webpage %>% 
      html_nodes(xpath='//*[@id="datalist1"]/tbody/tr[1]/td/table/tbody/tr[1]/td[1]/comment()')%>%html_text()
#character(0) ' this is output

But the above code gives out an empty character string. Since I have never used Xpath, I don't understand if this is even correct way to go about it.

I'll have to run this for all comment elements. I guess in short my question is How to extract comments in HTML code ?

5
  • Try to remove tbody from XPath (/table/tbody/tr[1] --> /table//tr[1]) as it can be added to DOM by browser Commented Feb 8, 2018 at 14:01
  • ...and as now you're looking towards XPath solution, you might need to check again my answer to your previous question :) Commented Feb 8, 2018 at 14:06
  • Yes! when I checked source code of the site tbody wasn't there. I'll try to use it without tbody Commented Feb 8, 2018 at 14:23
  • Do you just want all comments in an HTML document or is there some specific rule for which ones you want? It's difficult to tell from your example. Commented Feb 8, 2018 at 16:28
  • I wanted all comments with <b> tags hidden in them Commented Feb 9, 2018 at 9:17

2 Answers 2

1

May be this can help you :

webpage %>% 
html_nodes(xpath='//*[@id="datalist1"]') %>%  
          extract2(1) %>% html_nodes("tr") %>%  
          extract2(1) %>% html_nodes("td") %>% 
          extract2(2) %>% html_nodes(xpath = '//comment()') %>% extract2(15) %>%  html_text()
Sign up to request clarification or add additional context in comments.

Comments

0
library(rvest)
library(tidyverse)

pg <- read_html("http://agriexchange.apeda.gov.in/ExportersDirectory/exporters_list.aspx?letter=Z")

html_nodes(pg, xpath=".//comment()[contains(., 'IE CODE')]/../../..") %>% # target the comment then back up to the table
  map_df(~{

    # extract the <td> (column 1)
    html_nodes(.x, xpath=".//td[1]") %>% 
      html_text(trim=TRUE) %>% 
      str_replace_all("[[:space:]]+", " ") -> tmp

    # add in the comment to the "missing" <td> value
    html_node(.x, xpath=".//comment()") %>% 
      html_text() %>% 
      stri_replace_all_regex("<b>|</b>", "") -> tmp[1]

    # set it up for data frame-ing
    set_names(as.list(tmp), sprintf("X%s", 1:8))

  })
## # A tibble: 196 x 8
##                        X1                      X2                                                                           X3
##                     <chr>                   <chr>                                                                        <chr>
##  1  IE CODE : 0514026049           Z A M PRODUCTS                                          54 DAROOD GRAN SHAHPEER GATE MEERUT
##  2  IE CODE : AQDPV0923E                Z CONNECT             H-302, AIRFORCE NAVAL, ATHIPALAYAM PIRIVU, GANAPATHY, COIMBATORE
##  3  IE CODE : 2912000459        Z K INTERNATIONAL                           MUGHALPURA IST NEAR ISMAIL BEG KI MASJID MORADABAD
##  4  IE CODE : 0307069753  Z K R INTERNATIONAL CO.            4084, PLAZA SHOPPING CENTRE,104/142, SHERIF DEVJI STREET, MUMBAI,
##  5  IE CODE : 3117507531          Z S ENTERPRISES  SURVEY NO 12,PLOT NO.64,FLAT NO 1, KAUSARBAUGH NIBM ROAD KONDHWA KHURD PUNE
##  6  IE CODE : 0500009503               Z. EXPORTS                                 T-283, NEAR GURUDWARA BHAIJI B AHATA KIDARA,
##  7  IE CODE : 0713030658        Z. K. MANGO MANDI                              APMC YARD, RMC CHANNAPATNA, RAMANAGARA DISTRICT
##  8  IE CODE : 0599037351             Z.A. CRAFTS,                      A-56, GALI NO. 6, CHOUHAN BANGER, NEW SEELAM PUR, DELHI
##  9  IE CODE : 0609001353        Z.B.INTERNATIONAL 1ST FLOOR,25TH MILE STONE,AGRA MATHURA ROAD,VILL CHUMURA, POST-FARAH MATHURA
## 10  IE CODE : 0501009256             Z.D. EXPORTS             J-51, EXTENSION, STREET NO. 12/3, RAMESH PARK, LAXMI NAGAR DELHI
## # ... with 186 more rows, and 5 more variables: X4 <chr>, X5 <chr>, X6 <chr>, X7 <chr>, X8 <chr>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.