0

My goal is to extract date frequency of news articles from a downloaded html news service file. Step 1: extract dates from html file. Step 2: calculate frequency of articles on a particular date.

I am struggling to parse the data as the file seems relatively unstructured, although I am no XML expert. My process has been as follows:

library(XML)
test <- htmlParse('Xi.html')
rt <- xmlRoot(test)

table(names(rt))

This yields:

body head 
   1    1

table(unlist(xmlApply(rt,names)))

which returns:

a     b    br  font    hr  meta style table 
1     1     3     2     2     1     2   661 
 text title 
    3     1 

So it seems the majority of information is in the tables. However, these are not structured in a way that could be retrieved via htmlTable() as the data is presented on separate lines BUT the columns are effectively concatenated with no separation of the text.

nodeset <- getNodeSet(test,"//table")
head(nodeset)

gives

[[1]]
<table border="0" cellpadding="0" cellspacing="0" width="100%">
  <tr bgcolor="#f1f1f1">
    <td align="left" height="36">
                <img src="http://XXXXX.gif"/></td>
  </tr>
</table> 

[[2]]
<table width="100%" style="table-layout:fixed;">
  <tr><td width="30px" valign="top"><font size="2">1. </font></td>
<td><font size="3">港人喜見黃金馬車 馳向中英關係黃金時代</font> 

<font size="2" face="Arial">[Ta Kung Pao] 2015-10-27    B21 通識新世代   中英社評    </font> </td>
</tr>
  <tr><td colspan="2">
<table width="100%"/></td>
</tr>
</table> 

[[3]]
<table width="100%"/> 
[[4]]
<table width="100%" style="table-layout:fixed;">
  <tr><td width="30px" valign="top"><font size="2">2. </font></td>
<td><font size="3">High-level exchanges between China and ROK</font> 

<font size="2" face="Arial">[China Daily] 2015-10-27        Asia-Pacific        </font> </td>
</tr>
  <tr><td colspan="2">
<table width="100%"/></td>
</tr>
</table> 

[[5]]
<table width="100%"/> 

So rather than trying to extract data by somehow creating a dataframe, I think my only option would be to use a regex to extract the dates from the whole text. I thought a first step to doing this might be to perform a string split on the list after the "]" where all dates in the file are located, so I tried:

b <- unlist(strsplit(test,"]"))

But this returns thee error:

Error in strsplit(test, "]") : non-character argument

I'm appreciative of any help to put me on the right track.

All dates are in the below format:

2015-10-27

1 Answer 1

1

I know R only a little bit, but strsplit expects a string. You give it test, which is the result from htmlParse and seems to be some sort of tree.


R's regular expressions are either extended or Perl-like. To match all dates no matter what, you can use

\d\d\d\d-\d\d-\d\d

Looking through strsplit's manual, it seems to be the wrong tool for extracting the dates. You should rather look into grep. Something like

dates <- grep("\d\d\d\d-\d\d-\d\d", htmltext, value = TRUE)

might work and should return the dates.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.