Extracting Comments from HTML code using Xpath

Question

I'm trying to get out what's written in comment of following HTML code snippet, this is only a part of that code:

<table id="datalist1" cellspacing="0" border="0" style="border-width:1px;border-style:solid;width:100%;border-collapse:collapse;">
<tr>
    <td style="font-size:7pt;">
                                            <table width="100%" border="0" cellspacing="0" cellpadding="0">
                                                <tr align="left">
                                                    <td width="50%" class="subhead1">
                                                        <!-- <b>IE CODE : 0514026049</b> --> ' I want text inside this comment

                                                    </td>
                                                    <td rowspan="9" valign="top">
                                                        <span id="datalist1_ctl00_lbl_p"></span>
                                                    </td>
                                                </tr>

I am trying the following approach

1) Get Xpath of element.

2) Read Web_page

3) Go to comment node

4) extract text in comment

  library(rvest)
  library(xml2)

  url <- 'http://agriexchange.apeda.gov.in/ExportersDirectory/exporters_list.aspx?letter=Z'
  webpage <- read_html(url)
    ' Xpath of comment element I want to grab
    //*[@id="datalist1"]/tbody/tr[1]/td/table/tbody/tr[1]/td[1]/comment()



  webpage %>% 
      html_nodes(xpath='//*[@id="datalist1"]/tbody/tr[1]/td/table/tbody/tr[1]/td[1]/comment()')%>%html_text()
#character(0) ' this is output

But the above code gives out an empty character string. Since I have never used Xpath, I don't understand if this is even correct way to go about it.

I'll have to run this for all comment elements. I guess in short my question is How to extract comments in HTML code ?

Try to remove tbody from XPath (/table/tbody/tr[1] --> /table//tr[1]) as it can be added to DOM by browser — Andersson
– Andersson, Commented Feb 8, 2018 at 14:01
...and as now you're looking towards XPath solution, you might need to check again my answer to your previous question :) — Andersson
– Andersson, Commented Feb 8, 2018 at 14:06
Yes! when I checked source code of the site tbody wasn't there. I'll try to use it without tbody — Digvijay
– Digvijay, Commented Feb 8, 2018 at 14:23
Do you just want all comments in an HTML document or is there some specific rule for which ones you want? It's difficult to tell from your example. — hrbrmstr
– hrbrmstr, Commented Feb 8, 2018 at 16:28

MrSmithGoesToWashington · Accepted Answer · 2018-02-08 14:03:05Z

1

May be this can help you :

webpage %>% 
html_nodes(xpath='//*[@id="datalist1"]') %>%  
          extract2(1) %>% html_nodes("tr") %>%  
          extract2(1) %>% html_nodes("td") %>% 
          extract2(2) %>% html_nodes(xpath = '//comment()') %>% extract2(15) %>%  html_text()

answered Feb 8, 2018 at 14:03

MrSmithGoesToWashington

1,07611 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

hrbrmstr · Accepted Answer · 2018-02-08 16:42:16Z

library(rvest)
library(tidyverse)

pg <- read_html("http://agriexchange.apeda.gov.in/ExportersDirectory/exporters_list.aspx?letter=Z")

html_nodes(pg, xpath=".//comment()[contains(., 'IE CODE')]/../../..") %>% # target the comment then back up to the table
  map_df(~{

    # extract the <td> (column 1)
    html_nodes(.x, xpath=".//td[1]") %>% 
      html_text(trim=TRUE) %>% 
      str_replace_all("[[:space:]]+", " ") -> tmp

    # add in the comment to the "missing" <td> value
    html_node(.x, xpath=".//comment()") %>% 
      html_text() %>% 
      stri_replace_all_regex("<b>|</b>", "") -> tmp[1]

    # set it up for data frame-ing
    set_names(as.list(tmp), sprintf("X%s", 1:8))

  })
## # A tibble: 196 x 8
##                        X1                      X2                                                                           X3
##                     <chr>                   <chr>                                                                        <chr>
##  1  IE CODE : 0514026049           Z A M PRODUCTS                                          54 DAROOD GRAN SHAHPEER GATE MEERUT
##  2  IE CODE : AQDPV0923E                Z CONNECT             H-302, AIRFORCE NAVAL, ATHIPALAYAM PIRIVU, GANAPATHY, COIMBATORE
##  3  IE CODE : 2912000459        Z K INTERNATIONAL                           MUGHALPURA IST NEAR ISMAIL BEG KI MASJID MORADABAD
##  4  IE CODE : 0307069753  Z K R INTERNATIONAL CO.            4084, PLAZA SHOPPING CENTRE,104/142, SHERIF DEVJI STREET, MUMBAI,
##  5  IE CODE : 3117507531          Z S ENTERPRISES  SURVEY NO 12,PLOT NO.64,FLAT NO 1, KAUSARBAUGH NIBM ROAD KONDHWA KHURD PUNE
##  6  IE CODE : 0500009503               Z. EXPORTS                                 T-283, NEAR GURUDWARA BHAIJI B AHATA KIDARA,
##  7  IE CODE : 0713030658        Z. K. MANGO MANDI                              APMC YARD, RMC CHANNAPATNA, RAMANAGARA DISTRICT
##  8  IE CODE : 0599037351             Z.A. CRAFTS,                      A-56, GALI NO. 6, CHOUHAN BANGER, NEW SEELAM PUR, DELHI
##  9  IE CODE : 0609001353        Z.B.INTERNATIONAL 1ST FLOOR,25TH MILE STONE,AGRA MATHURA ROAD,VILL CHUMURA, POST-FARAH MATHURA
## 10  IE CODE : 0501009256             Z.D. EXPORTS             J-51, EXTENSION, STREET NO. 12/3, RAMESH PARK, LAXMI NAGAR DELHI
## # ... with 186 more rows, and 5 more variables: X4 <chr>, X5 <chr>, X6 <chr>, X7 <chr>, X8 <chr>

Collectives™ on Stack Overflow

Extracting Comments from HTML code using Xpath

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related