character encoding error not resolved by specifying encoding

Question

I am trying to extract text from a Spanish-language source in R, and running into a character encoding problem which is not resolved by explicitly specifying the encoding within htmlParse, as recommended here.

library(XML)
library(httr)
url <- "http://www3.hcdn.gov.ar//folio-cgi-bin/om_isapi.dll?E1=&E11=&E12=&E13=&E14=&E15=&E16=&E17=&E18=&E2=&E3=&E5=ley&E6=&E7=&E9=&headingswithhits=on&infobase=proy.nfo&querytemplate=Consulta%20de%20Proyectos%20Parlamentarios&record={4EBB}&recordswithhits=on&softpage=Document42&submit=ejecutar%20"
doc <- htmlParse(rawToChar(GET(url)$content),encoding="windows-1252")
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
text[77]

The 77th element, which includes an accented i, has the offending characters. The fourth line has some additional hoops I have to jump through to read this source. The document itself claims to be encoded in "windows-1252." Specifying "latin1" and several other encodings I have tried are no better. In my actual application, I have already downloaded many of these files and am reading them locally using readLines...and I can tell that the error is not present after reading the file into R, so the problem must be in htmlParse. Also, just accepting the encoding error and correcting it ex post does not seem to be an option, as R does not even recognize the characters it is spitting out if I try to copy and paste them back into a script.

farmkid · Accepted Answer · 2016-11-14 20:00:10Z

1

Here is a quick fix that may work after you bring the file into R

Encoding(text) <- "UTF-8"

Changing the coding to "UTF-8" makes Spanish files a lot more usable.

answered Nov 14, 2016 at 20:00

farmkid

4202 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

character encoding error not resolved by specifying encoding

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related