I have a list of two tables in R.
Each data frame contains several character and numeric columns. One of the columns is a company name column (for example, Company_Name).

The target database only supports UTF-8 encoding.
When I upload the tables, some company names with special characters get corrupted. For example, a value like:

L′ORÉAL PARIS

turns into something like:

L′OR@AL PARIS

Similar distortions happen for other names with accents or special characters.

Before writing to the database, I try to convert all character columns in R to UTF-8:

library(stringi)

make_utf8 <- function(df) {
  df[] <- lapply(df, function(col) {
    if (is.character(col)) stri_enc_toutf8(col) else col
  })
  df
}

After doing that, I check the encodings of the columns. However, I still see a mix of reported encodings such as UTF-8, ASCII, and unknown for different character columns.

I know that ASCII is technically a subset of UTF-8, but even after these conversions and checks, the database issue remains: company names still get corrupted once they are loaded into the database.

My questions are:

  1. Is there a way in R to reliably force all character columns in these data frames to be valid UTF-8 strings, so that I don’t end up with mixed or unknown encodings?

  2. Is it normal that R still reports some character columns as ASCII or unknown even when the strings should be valid UTF-8?

  3. What is the recommended way to prepare text data in R for uploading to a UTF-8-only database, to avoid this kind of corruption?

3 Replies 3

Post your actual code. The problem is in the application or terminal settings, not the database or Unicode.

If you see é the data is properly stored but the application displays the valid UTF8 string incorrectly, using a Latin1 encoding. The letter ÉA is represented by the bytes 0xC3 0x89 in UTF8. If you try to read or display those bytes as if they were Latin1 you'll get é

You don't have to force anything. "Forcing" always causes problems and can even lose data. If you don't try to force anything and make sure files, scripts, locales etc are all Unicode, nothing will get mixed up.

UTF8 isn't some kind of escaping. This page is UTF8. The 7-bit US-ASCII range is also valid UTF8 by design. UTF8 uses the exact same bytes for the 7-bit US-ASCII range of characters (0x00-0x7F) and 2 or more bytes for characters above this. All accented characters are outside that range and so use 2 or more bytes.

Somehow a non-Unicode encoding gets hard-coded somewhere. Perhaps you run the script on a machine with a non-Unicode LANG or LC_ENCODING setting. fr_FR.Latin1 would cause this problem. Perhaps your user profile or terminal was configured to use a non-Unicode codepage.

Or whatever displays the data doesn't use Unicode. Or the data was exported to a file, then read as Latin1 instead of UTF8.

Perhaps the data was saved as CSV without a BOM, and whatever read the file treated it as Latin1, causing the mangling. The other application should read that file as UTF8. Or you should include a BOM at the start of the file. Sure, UTF8 data shouldn't have a BOM, but there's no other way to tell if a file is UTF8 or not without reading it from start to finish, checking for out-of-range characters.

Just a note. Your function can be shortened to df[] = lapply(df, \(x) if (is.character(x)) stri_enc_toutf8(x) else x), do not forget to assign! In other words, if you use make_utf8() as is, you need to do X = make_utf8(df=X) to overwrite X. However, I think we should focus on the import process, please show (as well as a snippet of the data).

Your Reply

By clicking “Post Your Reply”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.