0

I am writing some software that takes rows from an XLS file and inserts them into a database.

In OpenOffice, a cell looks like this :

Brunner Straße, Parzelle

I am using the ExcelFormat library from CodeProject.

int type = cell->Type();
cout << "Cell contains " << type << endl;
const char* cellCharPtr = cell->GetString();
if (cellCharPtr != 0) {
  value.assign(cellCharPtr);
  cout << "normal string -> " << value << endl;
}

The string when fetched with the library, is returned as a char* (so cell->Type() returns STRING, not WSTRING) and now looks like this (on the console) :

normal string -> Brunner Stra�e, Parzelle
hex string -> 42 72 75 6e 6e 65 72 20 53 74 72 61 ffffffdf 65 2c 20 50 61 72 7a 65 6c 6c 65 

I insert it into the database using the mysql cpp connector like so :

prep_stmt = con -> prepareStatement ("INSERT INTO "
                  + tablename 
                  + "(crdate, jobid, imprownum, impid, impname, imppostcode, impcity, impstreet, imprest, imperror, imperrorstate)"
                  + " VALUES(?,?,?,?,?,?,?,?,?,?,?)");

<...snip...>

prep_stmt->setString(8,vals["street"]);

<...snip...>

prep_stmt->execute();

Having inserted it into the database, which has a utf8_general_ci collation, it looks like this :

Brunner Stra

which is annoying.

How do I make sure that whatever locale the file is in gets transformed to utf-8 when the string is retrieved from the xls file?

This is going to be running as a backend for a web service, where clients can upload their own excel files, so "Change the encoding of the file in Libre Office" can't work, I am afraid.

7
  • Would you please print the hex value of the byte array of the string? Commented Jan 23, 2013 at 9:54
  • Updated question with hex value. Commented Jan 23, 2013 at 10:07
  • ffffffdf obviously is not ASCII, and it's not UTF-8 either. I'd bet on Latin-1, but sign-extended. Commented Jan 23, 2013 at 10:26
  • Could you also include the code that inserts the string into the DB? The hex value looks like iso-8859-1, but the utf8_general_ci collation seems to be improperly truncated by \0s. Commented Jan 23, 2013 at 10:28
  • Ok I've added the db insertion code Commented Jan 23, 2013 at 10:42

1 Answer 1

1

Your input seems to be encoded in latin1, so you need to set the mysql "connection charset" to latin1.

I'm not familiar with the API you are using to connect to MySQL. In other APIs you'd add charset=latin1 to the connection URL or call an API function to set the connection encoding.

Alternatively you can recode the input before feeding it to MySQL.

Sign up to request clarification or add additional context in comments.

1 Comment

I recoded the input based on a configuration parameter, so I'll accept this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.