how to convert html lists into data frame in r?

Question

I have pieces of HTML that I need to convert to values in a dataframe.

For example this piece of html:

<div class="header">
<h3>title 1</h3>
</div>
<div class="content">
<ul>
<li>info1</li>
<li>info2
</li>
<li>info3
</li>
</ul>
</div>
<div class="header">
<h2>title 2</h2>
</div>
<div class="content">
<ul>
<li>info4</li>
<li>info5
</li>
<li>info6
</li>
</ul>
</div>

I want it to be changed into a dataframe like:

    Title  Info
1 title 1 info1
2 title 1 info2
3 title 1 info3
4 title 2 info4
5 title 2 info5
6 title 2 info6

I tried functions in the XML package and the tm.plugin.webmining package. Also I tried the code mentioned on this page:http://tonybreyal.wordpress.com/2011/11/18/htmltotext-extracting-text-from-html-via-xpath/ Until now i haven't succeeded to find a function that does what I want. Does anyone have an idea about how to deal with this problem?

MrFlick · Accepted Answer · 2014-07-25 18:28:38Z

3

I think the HTML parsing in the XML library will help here. Let's assume that the HTML input you've shown above is stored in a variable called intext. We can then process your data with

library(XML)
hh <- htmlParse(intext, asText=T)

#use xpath to extract data
titles <- xpathSApply(hh, "//div[@class='header']/*/text()", xmlValue)
info <- xpathApply(hh, "//div[@class='content']/ul", function(x) 
    gsub("\\s+","",xpathSApply(x,"./li/text()", xmlValue)))

#merge results together
do.call(rbind, Map(cbind, titles, info))

This returns

     [,1]      [,2]   
[1,] "title 1" "info1"
[2,] "title 1" "info2"
[3,] "title 1" "info3"
[4,] "title 2" "info4"
[5,] "title 2" "info5"
[6,] "title 2" "info6"

which is a matrix that you can easily turn into a data.frame if you like.

answered Jul 25, 2014 at 18:28

MrFlick

209k19 gold badges300 silver badges324 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

rdatasculptor Over a year ago

thank you! It works perfectly on this example. I thought I had made up a good example, but on my real data it doesn't work quite that well. The thing is I tried to parse my puplic linkedin page. Because that is way too much html to put in here, i came up with the example.

MrFlick Over a year ago

Unfortunately there is no magic extractAllTheDataIWantInTheFormatIWant() function. If the data is well structured, you at least might be able to come up with a few rules to extract the parts you want. Here we assume that there are equal numbers of "header" and "content" divs. With the headers, we extract the text from the child node (which in your example was an h3 or h2 tag). With the content nodes, we find the unnumbered lists and then extract the text from the li elements.

Collectives™ on Stack Overflow

how to convert html lists into data frame in r?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related