Using R and regex to parse data from unstructured html using XML library

Question

My goal is to extract date frequency of news articles from a downloaded html news service file. Step 1: extract dates from html file. Step 2: calculate frequency of articles on a particular date.

I am struggling to parse the data as the file seems relatively unstructured, although I am no XML expert. My process has been as follows:

library(XML)
test <- htmlParse('Xi.html')
rt <- xmlRoot(test)

table(names(rt))

This yields:

body head 
   1    1

table(unlist(xmlApply(rt,names)))

which returns:

a     b    br  font    hr  meta style table 
1     1     3     2     2     1     2   661 
 text title 
    3     1

So it seems the majority of information is in the tables. However, these are not structured in a way that could be retrieved via htmlTable() as the data is presented on separate lines BUT the columns are effectively concatenated with no separation of the text.

nodeset <- getNodeSet(test,"//table")
head(nodeset)

gives

[[1]]
<table border="0" cellpadding="0" cellspacing="0" width="100%">
  <tr bgcolor="#f1f1f1">
    <td align="left" height="36">
                <img src="http://XXXXX.gif"/></td>
  </tr>
</table> 

[[2]]
<table width="100%" style="table-layout:fixed;">
  <tr><td width="30px" valign="top"><font size="2">1. </font></td>
<td><font size="3">港人喜見黃金馬車 馳向中英關係黃金時代</font> 

<font size="2" face="Arial">[Ta Kung Pao] 2015-10-27    B21 通識新世代   中英社評    </font> </td>
</tr>
  <tr><td colspan="2">
<table width="100%"/></td>
</tr>
</table> 

[[3]]
<table width="100%"/> 
[[4]]
<table width="100%" style="table-layout:fixed;">
  <tr><td width="30px" valign="top"><font size="2">2. </font></td>
<td><font size="3">High-level exchanges between China and ROK</font> 

<font size="2" face="Arial">[China Daily] 2015-10-27        Asia-Pacific        </font> </td>
</tr>
  <tr><td colspan="2">
<table width="100%"/></td>
</tr>
</table> 

[[5]]
<table width="100%"/>

So rather than trying to extract data by somehow creating a dataframe, I think my only option would be to use a regex to extract the dates from the whole text. I thought a first step to doing this might be to perform a string split on the list after the "]" where all dates in the file are located, so I tried:

b <- unlist(strsplit(test,"]"))

But this returns thee error:

Error in strsplit(test, "]") : non-character argument

I'm appreciative of any help to put me on the right track.

All dates are in the below format:

2015-10-27

Olaf Dietsche · Accepted Answer · 2015-10-27 08:10:47Z

1

I know R only a little bit, but strsplit expects a string. You give it test, which is the result from htmlParse and seems to be some sort of tree.

R's regular expressions are either extended or Perl-like. To match all dates no matter what, you can use

\d\d\d\d-\d\d-\d\d

Looking through strsplit's manual, it seems to be the wrong tool for extracting the dates. You should rather look into grep. Something like

dates <- grep("\d\d\d\d-\d\d-\d\d", htmltext, value = TRUE)

might work and should return the dates.

edited Oct 27, 2015 at 8:10

answered Oct 27, 2015 at 7:52

Olaf Dietsche

74.4k9 gold badges113 silver badges214 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Using R and regex to parse data from unstructured html using XML library

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related