3

I have pieces of HTML that I need to convert to values in a dataframe.

For example this piece of html:

<div class="header">
<h3>title 1</h3>
</div>
<div class="content">
<ul>
<li>info1</li>
<li>info2
</li>
<li>info3
</li>
</ul>
</div>
<div class="header">
<h2>title 2</h2>
</div>
<div class="content">
<ul>
<li>info4</li>
<li>info5
</li>
<li>info6
</li>
</ul>
</div>

I want it to be changed into a dataframe like:

    Title  Info
1 title 1 info1
2 title 1 info2
3 title 1 info3
4 title 2 info4
5 title 2 info5
6 title 2 info6

I tried functions in the XML package and the tm.plugin.webmining package. Also I tried the code mentioned on this page:http://tonybreyal.wordpress.com/2011/11/18/htmltotext-extracting-text-from-html-via-xpath/ Until now i haven't succeeded to find a function that does what I want. Does anyone have an idea about how to deal with this problem?

1 Answer 1

3

I think the HTML parsing in the XML library will help here. Let's assume that the HTML input you've shown above is stored in a variable called intext. We can then process your data with

library(XML)
hh <- htmlParse(intext, asText=T)

#use xpath to extract data
titles <- xpathSApply(hh, "//div[@class='header']/*/text()", xmlValue)
info <- xpathApply(hh, "//div[@class='content']/ul", function(x) 
    gsub("\\s+","",xpathSApply(x,"./li/text()", xmlValue)))

#merge results together
do.call(rbind, Map(cbind, titles, info))

This returns

     [,1]      [,2]   
[1,] "title 1" "info1"
[2,] "title 1" "info2"
[3,] "title 1" "info3"
[4,] "title 2" "info4"
[5,] "title 2" "info5"
[6,] "title 2" "info6"

which is a matrix that you can easily turn into a data.frame if you like.

Sign up to request clarification or add additional context in comments.

2 Comments

thank you! It works perfectly on this example. I thought I had made up a good example, but on my real data it doesn't work quite that well. The thing is I tried to parse my puplic linkedin page. Because that is way too much html to put in here, i came up with the example.
Unfortunately there is no magic extractAllTheDataIWantInTheFormatIWant() function. If the data is well structured, you at least might be able to come up with a few rules to extract the parts you want. Here we assume that there are equal numbers of "header" and "content" divs. With the headers, we extract the text from the child node (which in your example was an h3 or h2 tag). With the content nodes, we find the unnumbered lists and then extract the text from the li elements.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.