3

I need to extract the following a block of text from a set of google results obtained using

require(XML)
    require(RCurl)
input<-"R%statistical%Software"
 require(XML)
    require(RCurl)
    url <- paste("https://www.google.com/search?q=\"",
                 input, "\"", sep = "")

    CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
    script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
    doc <- htmlParse(script)

in the R package XML

An extract of the extracted HTML document as follows

</ul></div>
</div>
</div>
<span class="st">R, also called GNU S, is a strongly functional language and environment to <br>
statistically explore data sets, make many graphical displays of data from custom<br>
 <b>...</b></span><br>
</div>
<table class="slk" cellpadding="0" cellspacing="0" style="border-collapse:collapse;margin-top:1px">
<tr class="mslg">
<td style="padding-left:23px;vertical-align:top"><div class="sld">

In this example I need to extract the following text for each result returned

"R, also called GNU S, is a strongly functional language and environment to
statistically explore data sets, make many graphical displays of data from custom
"

I have had a go with some of the functions in the XML package for R, but I don't think I understand enough about HTML and XML. The text will vary for each result returned, so its actually the

<span class="st">

?field? I need to extract. As you have probably guessed I am not familiar with HTML or XML. So any recommendations for a good tutorial or book that would give me enough of an overview to solve these kind of problems would be most welcome. Thanks

1
  • Can you post a link to the file you are parsing? Commented Feb 15, 2014 at 8:14

1 Answer 1

4

This returns a list, result with the text from all span tags using class="st" (there are 7 in your document).

input<-"R%statistical%Software"
url <- paste0("http://www.google.com/search?q=",input)
doc <- htmlParse(url)
result <- lapply(doc['//span[@class="st"]'],xmlValue)
result[1]
# [[1]]
# [1] "R, also called GNU S, is a strongly functional language and environment to \nstatistically explore data sets, make many graphical displays of data from custom\n ..."

Note the use of http instead of https greatly simplifies retrieval of the document.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.