10

I am trying to change the encoding of a column in a dataframe.

stri_enc_mark(data_updated$text)
#   [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8"
#  [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8"
#  [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII"
#  [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII"

When I try to convert it, it does not throw an error, but still has no effect on the vector:

d <- enc2utf8(data_updated$text)
stri_enc_mark(d)
#   [1] "UTF-8" "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "UTF-8" "UTF-8" "UTF-8"
#  [10] "ASCII" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8"
#  [19] "ASCII" "UTF-8" "ASCII" "UTF-8" "ASCII" "UTF-8" "UTF-8" "ASCII" "ASCII"
#  [28] "ASCII" "ASCII" "UTF-8" "ASCII" "ASCII" "ASCII" "UTF-8" "UTF-8" "ASCII"

Any suggestions?

I am on Windows 7, 32bit. Adding data snippet.

> Encoding(data_updated$text[1:35])
 [1] "UTF-8"   "unknown" "unknown" "UTF-8"   "unknown" "unknown" "UTF-8"  
 [8] "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"   "unknown" "UTF-8"  
[15] "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown"
[22] "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "unknown" "unknown" "unknown"
[29] "unknown" "UTF-8"   "unknown" "unknown" "unknown" "UTF-8"   "UTF-8"

Data looks like this.

> data_updated$text[1:35]
 [1] "RT @satpalpandey: Majlis started in Sirsa Ashram.\nInform others too.\nLive @ http://t.co/zGXWATGajX\nIVR Airtel 55252\nReliance 56300403\n\n#MSG…"
 [2] "Deal Talks for Here Mapping Service Expose Reliance on Location Data, via @nytimes #mapping #dilemma  http://t.co/wGdiS5OlRq"                      
 [3] "http://t.co/UZIyX1Rk7W The popping linksexploaded!! http://t.co/KpNntm1dH7 :) http://t.co/oku91uVxZ8"                                              
 [4] "RT @davidsunaria90: Wtch LIVE Mjlis Now\n http://t.co/GXNhe3eY7Y\nIVR Airtel: 55252\nReliance: 56300403\nYoutube Link : http://t.co/YewOVcz8bb\n…" 
 [5] "Reliance Jio Infocomm: Indian carrier raises $750 million loan for 4G rollout  http://t.co/B2aWlkmwXz"                                             
 [6] "RT @SurjeetInsan: Majlis started in Sirsa Ashram.\nLive @ http://t.co/PR6W5tzZes\nIVR Airtel 55252\nReliance 56300403\n\n#MSGPlsSaveTheEarth"      
 [7] "\"Deal Talks for Here Mapping Service Expose Reliance on Location Data\" by MARK SCOTT and MIKE ISAAC via NYT Techno… http://t.co/kyxTYIxks5"      
 [8] "RT @satpalpandey: Majlis started in Sirsa Ashram.\nInform others too.\nLive @ http://t.co/zGXWATGajX\nIVR Airtel 55252\nReliance 56300403\n\n#MSG…"
 [9] "RT @jaameinsan: Watch LIVE Majlis Now\n http://t.co/nPQegnLXPa\nIVR Airtel: 55252\nReliance: 56300403\nYoutube Link : http://t.co/txXMtw3zFP\n#M…" 
[10] "\"Deal Talks for Here Mapping Service Expose Reliance on Location Data\" by MARK SCOTT and MIKE ISAAC via NYT Technology"

These are tweets, and I think the "http://" links are dictating encoding here, given that they have expressions like "wGdiS5OlRq". For analysis I had removed these tags using regular expressions. But to store raw data in a DB i need these tweets. MongoDB does not have problem, but a RDBMS throws issues.

5
  • It would help to have a reproducible example. It would also be helpful to know what OS you are on and what Encoding() returns for those vectors. It's possible that if there are not any non-ascii characters in the string it will just return ASCII. Commented May 14, 2015 at 4:58
  • This is a pretty classic example when a problem could be simplified too. You have 36 data points. You need 2 to show off this problem - data_updated$text[1:2] would be plenty enough to show nothing changes from ASCII to UTF-8 Commented May 14, 2015 at 5:39
  • If the problem really is the RDBMS is throwing errors, then it would be better to describe that problem. The encoding of strings that only have ASCII characters shouldn't cause a problem. Commented May 14, 2015 at 6:16
  • The data table I am porting the data to is UTF-8 encoded. Hence I think it does not accept ASCII, the error says, "expected UTF-8" Commented May 14, 2015 at 8:27
  • But something that's ASCII encoded is also UTF-8 encoded. There would be nothing different in the bytes of the two stings. You can't tell the difference. How is this mystery function checking? Commented May 14, 2015 at 14:08

3 Answers 3

9

In case someone is still stuck : I used Encoding().

  for (col in colnames(mydataframe)){
  Encoding(mydataframe[[col]]) <- "UTF-8"}
Sign up to request clarification or add additional context in comments.

2 Comments

I got "Error in Encoding<-(*tmp*, value = "UTF-8") : a character vector argument expected" with this solution
Can try this solution to resolve the error: stackoverflow.com/questions/33731891/…
2

It appears that we can use the conv() function to convert the encoding after we convert the vector into Factor and then back to character vector. It is a bit strange to be honest.

Comments

1

I found stringi::stri_enc_toascii() is pretty useful and solve my problem.

I posted a case in How to handle example data in R Package that has UTF-8 marked strings

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.