how to extract text html using R

Question

I need to extract the following a block of text from a set of google results obtained using

require(XML)
    require(RCurl)
input<-"R%statistical%Software"
 require(XML)
    require(RCurl)
    url <- paste("https://www.google.com/search?q=\"",
                 input, "\"", sep = "")

    CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
    script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
    doc <- htmlParse(script)

in the R package XML

An extract of the extracted HTML document as follows

</ul></div>
</div>
</div>
<span class="st">R, also called GNU S, is a strongly functional language and environment to <br>
statistically explore data sets, make many graphical displays of data from custom<br>
Â <b>...</b></span><br>
</div>
<table class="slk" cellpadding="0" cellspacing="0" style="border-collapse:collapse;margin-top:1px">
<tr class="mslg">
<td style="padding-left:23px;vertical-align:top"><div class="sld">

In this example I need to extract the following text for each result returned

"R, also called GNU S, is a strongly functional language and environment to
statistically explore data sets, make many graphical displays of data from custom
"

I have had a go with some of the functions in the XML package for R, but I don't think I understand enough about HTML and XML. The text will vary for each result returned, so its actually the

<span class="st">

?field? I need to extract. As you have probably guessed I am not familiar with HTML or XML. So any recommendations for a good tutorial or book that would give me enough of an overview to solve these kind of problems would be most welcome. Thanks

Can you post a link to the file you are parsing?

jlhoward
– jlhoward

2014-02-15 08:14:43 +00:00
Commented Feb 15, 2014 at 8:14 — jlhoward
– jlhoward, Commented Feb 15, 2014 at 8:14

jlhoward · Accepted Answer · 2014-02-15 08:45:31Z

4

This returns a list, result with the text from all span tags using class="st" (there are 7 in your document).

input<-"R%statistical%Software"
url <- paste0("http://www.google.com/search?q=",input)
doc <- htmlParse(url)
result <- lapply(doc['//span[@class="st"]'],xmlValue)
result[1]
# [[1]]
# [1] "R, also called GNU S, is a strongly functional language and environment to \nstatistically explore data sets, make many graphical displays of data from custom\n ..."

Note the use of http instead of https greatly simplifies retrieval of the document.

edited Feb 15, 2014 at 8:45

answered Feb 15, 2014 at 8:18

jlhoward

59.6k7 gold badges105 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

how to extract text html using R

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related