0

I used R and DBI to query data from a PostgreSQL but got corrupted results with weird characters. I check both server encoding and client encoding, and they are all UTF-8. Here's what I did

  1. Connect to the db

    library(DBI)

    conn <- dbConnect( drv = RPostgreSQL::PostgreSQL(), host = *, dbname = , user = , password = * )

  2. Fetch the result

    result_df <- dbGetQuery( conn, "select description from some_table where user_id = 1234" )

  3. Result What I got is ã¯ã˜ã‚ã¾ã—ã¦1008(ã¨ãƒ¼ã‚„orã¨ã…ーや)ã¨ã„ã„ã¾ã™ã€‚ 表ç¾ã™ã‚‹ã®ãŒå¤§å¥½ããªå­ã§ã™!! 感性ãŒäººã¨ã‚ºãƒ¬ã¦ã¾ã™(笑) ãŠæ­Œã‚’ã‚‚ã£ã¨ã†ã¾ããªã‚ŠãŸã„ã§ã™ã€‚ while the actual text is はじめまして1008(とーやorとぅーや)といいます。 表現するのが大好きな子です!! 感性が人とズレてます(笑) お歌をもっとうまくなりたいです。

I test again by reading the text and then print it back

x <- 'はじめまして1008(とーやorとぅーや)といいます。 表現するのが大好きな子です!! 感性が人とズレてます(笑) お歌をもっとうまくなりたいです。'
print(x)
>> はじめまして1008(とーやorとぅーや)といいます。 表現するのが大好きな子です!! 感性が人とズレてます(笑) お歌をもっとうまくなりたいです。

There's nothing wrong with the encoding in this case, so I think the problem happened when the data is returned from the database into R, but I don't know how to fix it.

4
  • How does it look when you do the query with psql from R host machine? Commented Jun 15, 2017 at 4:59
  • it looked like this ã¯ã˜ã‚ã¾ã—ã¦1008(ã¨ãƒ¼ã‚„orã¨ã…ーや)ã¨ã„ã„ã¾ã™ã€‚ 表ç¾ã™ã‚‹ã®ãŒå¤§å¥½ããªå­ã§ã™!! 感性ãŒäººã¨ã‚ºãƒ¬ã¦ã¾ã™(笑) ãŠæ­Œã‚’ã‚‚ã£ã¨ã†ã¾ããªã‚ŠãŸã„ã§ã™ã€ Commented Jun 15, 2017 at 5:29
  • So it does seem to have nothing to do with R. What does psql -l show and SHOW client_encoding;. Do you have any conversion going on in pg_conversion? Commented Jun 15, 2017 at 10:54
  • When I start a new databse with default en_US.utf8 and paste your text to a new table with a text column, it works and pg_dump shows your text correctly as do select. What locale do you have on your machines? They might interfere. I have sv_SE.UTF-8 on my machine. Commented Jun 15, 2017 at 13:30

1 Answer 1

1


RPostgreSQL does not set string encodings properly, see this issue.

So in the meantime you have to declare the encoding yourself. With dplyr, you can do it like this:

suppressPackageStartupMessages({
  library(dplyr)
  library(RPostgreSQL)
})

my_con <- 
  dbConnect(
    PostgreSQL(),
    user = my_username,
    password = my_password,
    host = my_host,
    dbname = my_dbname
  )

my_table <- tribble(
  ~user_id,    ~description,
  "1234",     "はじめまして1008(とーやorとぅーや)といいます。 表現するのが大好きな子です!! 感性が人とズレてます(笑) お歌をもっとうまくなりたいです。"
)

copy_to(my_con, my_table)

result_df <- dbGetQuery(my_con, "SELECT description FROM my_table WHERE user_id = '1234'" )

declare_utf8 <- function(x) {
  Encoding(x) <- "UTF-8"
  x
}

result_df %>% mutate_if(is.character, declare_utf8) %>% pull()
#> [1] "はじめまして1008(とーやorとぅーや)といいます。 表現するのが大好きな子です!! 感性が人とズレてます(笑) お歌をもっとうまくなりたいです。"

dbDisconnect(my_con)
#> [1] TRUE

pull() is not necessary here. I just use it because your description would not print right as a data.frame() on my machine due to another character encoding bug, so I am pulling it out as a vector.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.