41

Possible Duplicate:
Excel to CSV with UTF8 encoding

Scenario: I have an excel file containing a large amount of global customer data. I do not know what encoding was used when the file was created.

Question: How can I determine the character encoding used in the excel file so I can import it correctly into another piece of software?

9
  • I guess that your problem is discussed and answered in superuser.com/questions/280603/… Commented Nov 5, 2012 at 15:42
  • 6
    @JüriRuut Not really, this question is the other way around. And I'd like a canonical answer on this as well, so +1 to the question. Commented Nov 5, 2012 at 15:55
  • @deceze: then it would be "export data from Excel"? Commented Nov 5, 2012 at 15:56
  • @JüriRuut I'm assuming he means "reading an .xls file using some library in some programming language". Then it all makes sense... Sam, correct this assumption if I'm wrong. Commented Nov 5, 2012 at 15:59
  • 3
    @deceze - you are spot-on! In order to import the file correctly I first need to know how it was originally encoded. If you import it and just assume a certain character set was used you could end up bad data - certain characters being lost or replaced with other characters unintentionally. Commented Nov 5, 2012 at 16:52

1 Answer 1

10

For Excel 2010 it should be UTF-8. Instruction by MS :
http://msdn.microsoft.com/en-us/library/bb507946:

"The basic document structure of a SpreadsheetML document consists of the Sheets and Sheet elements, which reference the worksheets in the Workbook. A separate XML file is created for each Worksheet. For example, the SpreadsheetML for a workbook that has two worksheets name MySheet1 and MySheet2 is located in the Workbook.xml file and is shown in the following code example.

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> 
<workbook xmlns=http://schemas.openxmlformats.org/spreadsheetml/2006/main xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
    <sheets>
        <sheet name="MySheet1" sheetId="1" r:id="rId1" /> 
        <sheet name="MySheet2" sheetId="2" r:id="rId2" /> 
    </sheets>
</workbook>

The worksheet XML files contain one or more block level elements such as SheetData. sheetData represents the cell table and contains one or more Row elements. A row contains one or more Cell elements. Each cell contains a CellValue element that represents the value of the cell. For example, the SpreadsheetML for the first worksheet in a workbook, that only has the value 100 in cell A1, is located in the Sheet1.xml file and is shown in the following code example.

<?xml version="1.0" encoding="UTF-8" ?> 
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
    <sheetData>
        <row r="1">
            <c r="A1">
                <v>100</v> 
            </c>
        </row>
    </sheetData>
</worksheet>

"

Detection of cell encodings:

https://metacpan.org/pod/Spreadsheet::ParseExcel::Cell

http://forums.asp.net/t/1608228.aspx/1

Sign up to request clarification or add additional context in comments.

4 Comments

how are you supposed to find these XML files for a given Excel file?
I am wondering if this is still an accurate way to determine the character encoding of an Excel sheet then, because I have a sheet containing international characters that are only supported by UTF-16, but the XML clearly labels it as encoding="UTF-8". Is this encoding referring to something besides the text contained in the sheet?
@user5359531 "I have a sheet containing international characters that are only supported by UTF-16" - If I understand correctly, UTF-8 and UTF-16 (and UTF-32) all support all unicode characters, they just use a different encoding to do so. (UTF-8 uses 1, 2, 3, or 4 bytes, UTF-16 uses 2 or 4 bytes, and UTF-32 always uses 4 bytes).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.