I have a list of two tables in R.
Each data frame contains several character and numeric columns. One of the columns is a company name column (for example, Company_Name).
The target database only supports UTF-8 encoding.
When I upload the tables, some company names with special characters get corrupted. For example, a value like:
L′ORÉAL PARIS
turns into something like:
L′OR@AL PARIS
Similar distortions happen for other names with accents or special characters.
Before writing to the database, I try to convert all character columns in R to UTF-8:
library(stringi)
make_utf8 <- function(df) {
df[] <- lapply(df, function(col) {
if (is.character(col)) stri_enc_toutf8(col) else col
})
df
}
After doing that, I check the encodings of the columns. However, I still see a mix of reported encodings such as UTF-8, ASCII, and unknown for different character columns.
I know that ASCII is technically a subset of UTF-8, but even after these conversions and checks, the database issue remains: company names still get corrupted once they are loaded into the database.
My questions are:
Is there a way in R to reliably force all character columns in these data frames to be valid UTF-8 strings, so that I don’t end up with mixed or unknown encodings?
Is it normal that R still reports some character columns as
ASCIIorunknowneven when the strings should be valid UTF-8?What is the recommended way to prepare text data in R for uploading to a UTF-8-only database, to avoid this kind of corruption?