2

I try to scrape a website but can't handle this encoding issue:

# putting together the url:
search_str <- "allintitle:amphibian richness OR diversity"
url <- paste("http://scholar.google.at/scholar?q=",
             search_str, "&hl=en&num=100&as_sdt=1,5&as_vis=1", sep = "")

# get content and parse it:
doc <- htmlParse(url)

# encoding isssue, like here..
xpathSApply(doc, '//div[@class="gs_a"]', xmlValue)

  [1] "M Vences, M Thomas… - …  of the Royal  …, 2005 - rstb.royalsocietypublishing.org"             
  [2] "PB Pearman - Conservation Biology, 1997 - Wiley Online Library"                                     
  [3] "D Vallan - Biological Conservation, 2000 - Elsevier"                                                
  [4] "LB Buckley, W Jetz - Proceedings of the Royal  …, 2007 - rspb.royalsocietypublishing.org"         
  [5] "Mà Rodríguez, JA Belmontes, BA Hawkins - Acta Oecologica, 2005 - Elsevier"                        
  [6] "TJC Beebee - Biological Conservation, 1997 - Elsevier"                                              
  [7] "D Vallan - Journal of Tropical Ecology, 2002 - Cambridge Univ Press"                                
  [8] "MO Rödel, R Ernst - Ecotropica, 2004 - gtoe.de" 
# ...

any pointers?

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Austria.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RCurl_1.91-1.1 bitops_1.0-4.1 XML_3.9-4.1   

loaded via a namespace (and not attached):
[1] tools_2.15.1

> getOption("encoding")
[1] "native.enc"

1 Answer 1

2

This worked to some degree for me

doc <- htmlParse(url,encoding="UTF-8")
head(xpathSApply(doc, '//div[@class="gs_a"]', xmlValue))
#[1] "M Vences, M Thomas… - …  of the Royal  …, 2005 - rstb.royalsocietypublishing.org"        
#[2] "PB Pearman - Conservation Biology, 1997 - Wiley Online Library"                          
#[3] "D Vallan - Biological Conservation, 2000 - Elsevier"                                     
#[4] "LB Buckley, W Jetz - Proceedings of the Royal  …, 2007 - rspb.royalsocietypublishing.org"
#[5] "MÁ Rodríguez, JA Belmontes, BA Hawkins - Acta Oecologica, 2005 - Elsevier"               
#[6] "TJC Beebee - Biological Conservation, 1997 - Elsevier"   

thou

xpathSApply(doc, '//div[@class="gs_a"]', xmlValue)[[81]]

was displaying incorrectly on my windows box for example.

switching to Font DotumChe using GUI preferences however showed it displaying correctly so it may just be a display issue not a parsing one.

Sign up to request clarification or add additional context in comments.

1 Comment

thanks! setting the encoding parameter helped. only some characters are not parsed/displayed correctly, like <U+0096>, <U+8BBA>, <U+6587>, <U+1ED1>, <U+1EC7>.. setting GUI-fonts didn't change this for me..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.