My goal is to extract date frequency of news articles from a downloaded html news service file. Step 1: extract dates from html file. Step 2: calculate frequency of articles on a particular date.
I am struggling to parse the data as the file seems relatively unstructured, although I am no XML expert. My process has been as follows:
library(XML)
test <- htmlParse('Xi.html')
rt <- xmlRoot(test)
table(names(rt))
This yields:
body head
1 1
table(unlist(xmlApply(rt,names)))
which returns:
a b br font hr meta style table
1 1 3 2 2 1 2 661
text title
3 1
So it seems the majority of information is in the tables. However, these are not structured in a way that could be retrieved via htmlTable() as the data is presented on separate lines BUT the columns are effectively concatenated with no separation of the text.
nodeset <- getNodeSet(test,"//table")
head(nodeset)
gives
[[1]]
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr bgcolor="#f1f1f1">
<td align="left" height="36">
<img src="http://XXXXX.gif"/></td>
</tr>
</table>
[[2]]
<table width="100%" style="table-layout:fixed;">
<tr><td width="30px" valign="top"><font size="2">1. </font></td>
<td><font size="3">港人喜見黃金馬車 馳向中英關係黃金時代</font>
<font size="2" face="Arial">[Ta Kung Pao] 2015-10-27 B21 通識新世代 中英社評 </font> </td>
</tr>
<tr><td colspan="2">
<table width="100%"/></td>
</tr>
</table>
[[3]]
<table width="100%"/>
[[4]]
<table width="100%" style="table-layout:fixed;">
<tr><td width="30px" valign="top"><font size="2">2. </font></td>
<td><font size="3">High-level exchanges between China and ROK</font>
<font size="2" face="Arial">[China Daily] 2015-10-27 Asia-Pacific </font> </td>
</tr>
<tr><td colspan="2">
<table width="100%"/></td>
</tr>
</table>
[[5]]
<table width="100%"/>
So rather than trying to extract data by somehow creating a dataframe, I think my only option would be to use a regex to extract the dates from the whole text. I thought a first step to doing this might be to perform a string split on the list after the "]" where all dates in the file are located, so I tried:
b <- unlist(strsplit(test,"]"))
But this returns thee error:
Error in strsplit(test, "]") : non-character argument
I'm appreciative of any help to put me on the right track.
All dates are in the below format:
2015-10-27