Proper encoding for fixed-length storage of Unicode strings?

Question

I'm going to be working on software (in c#) that needs to read/write Unicode strings (specifically English, German, Spanish and Arabic) to a hardware device. The firmware developer tells me that his code expects to store each string as fixed-length byte array in one binary file so he can quickly access any string using an index (index * length = starting offset and then read the fixed-length number of bytes). I understand that .NET internally uses a UTF-16 encoding which I believe is technically a variable-length encoding (depending upon the number of the Unicode code point). I'm fairly certain that English, German and Spanish would all use two bytes/character when encoded using UTF-16 but I'm not so sure about Arabic. It looks like there might be some Arabic characters that could possibly require three bytes each in UTF-16 and that would seem to break the firmware developers plan to store the strings as a fixed length.

First, can anyone confirm my understanding of the variable-length nature of UTF-8/UTF-16 encodings? And second, although it would waste a lot of space, is UTF-32 (fixed-size, each character represented using 4 bytes) the best option for ensuring that each string could be stored as a fixed length? Thanks!

I'm not sure I see the problem. So long as your encoded strings don't exceed the fixed field length, what problem is introduced by using a variable-length encoding? You have to agree on some way to mark the boundary between the end of the string and any remaining unused bytes, but that problem applies to using a fixed-length encoding as well. — anton.burger
– anton.burger, Commented Dec 5, 2012 at 16:51
Also, UTF-16 could well be fine. According to wiki, most of Arabic fits in the Basic Multilingual Plane, meaning you could get to use one 16-bit code unit per code point most of the time. Failing that, you use 2 code units (for a total of 4 bytes, but never 3 in UTF-16). Probably best to know exactly which ranges you need to represent though. en.wikipedia.org/wiki/Arabic_script_in_Unicode — anton.burger
– anton.burger, Commented Dec 5, 2012 at 16:55
There is no such thing as fixed length in Unicode. See "length" in utf8everywhere.org — Pavel Radzivilovsky
– Pavel Radzivilovsky, Commented Dec 6, 2012 at 16:49

McDowell · Accepted Answer · 2012-12-05 16:36:59Z

Unicode terminology:

Each entry in the Unicode character set is a code point
Encoded code points consist of one or more code units in a transformation format (UTF-8 uses 8 bit code units; UTF-16 uses 16 bit code units)
The user-visible grapheme might consist of a sequence of code points

So:

A code point in UTF-8 is 1, 2, 3 or 4 octets wide
A code point in UTF-16 is 2 or 4 octets wide
A code point in UTF-32 is 4 octets wide
The number of graphemes rendered on the screen might be less than the number of code points

So, if you want to support the entire Unicode range you need to make the fixed-length strings a multiple of 32 bits regardless of which of these UTFs you choose as the encoding (I'm assuming unused bytes will be set to 0x0 and that these will be appended, trimmed during I/O.)

In terms of communicating length restrictions via a user interface you'll probably want to decide on some compromise based on a code unit size and the typical customer rather than try to find the width of the most complicated grapheme you can build.

Collectives™ on Stack Overflow

Proper encoding for fixed-length storage of Unicode strings?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related