0

I am trying to extract text from a Spanish-language source in R, and running into a character encoding problem which is not resolved by explicitly specifying the encoding within htmlParse, as recommended here.

library(XML)
library(httr)
url <- "http://www3.hcdn.gov.ar//folio-cgi-bin/om_isapi.dll?E1=&E11=&E12=&E13=&E14=&E15=&E16=&E17=&E18=&E2=&E3=&E5=ley&E6=&E7=&E9=&headingswithhits=on&infobase=proy.nfo&querytemplate=Consulta%20de%20Proyectos%20Parlamentarios&record={4EBB}&recordswithhits=on&softpage=Document42&submit=ejecutar%20"
doc <- htmlParse(rawToChar(GET(url)$content),encoding="windows-1252")
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
text[77]

The 77th element, which includes an accented i, has the offending characters. The fourth line has some additional hoops I have to jump through to read this source. The document itself claims to be encoded in "windows-1252." Specifying "latin1" and several other encodings I have tried are no better. In my actual application, I have already downloaded many of these files and am reading them locally using readLines...and I can tell that the error is not present after reading the file into R, so the problem must be in htmlParse. Also, just accepting the encoding error and correcting it ex post does not seem to be an option, as R does not even recognize the characters it is spitting out if I try to copy and paste them back into a script.

1 Answer 1

1

Here is a quick fix that may work after you bring the file into R

Encoding(text) <- "UTF-8"

Changing the coding to "UTF-8" makes Spanish files a lot more usable.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.