UTF8 string to byte[] with each character as single byte

Question

I would like to take input from user as UTF8 string & then detect the language of the String & store the string as a compressed byte[]. If all characters are not of same language, then it is not a valid input. After getting a valid input from user I would like to store this input string as bytes array.

If user entered string with non english characters then each character would occupy more than 1 byte, so I would like to store the language of the string & then store each character in a single byte(i guess it would now be possible to store the character in single byte by storing just difference from start code point of that language & since all characters are from same language & may(!?) therefore fit in single byte capacity because of small range!?). This is how I compress each character to fit in single byte.

Is this a correct approach? If yes how can I detect the language of the characters in the string ?

We can't detect/determine the character encoding of a byte array. We have to know it or to guess. Looks to me as if you mix language and character encoding. I'm a little bit confused. (a character does not have a language) — Andreas Dolk
– Andreas Dolk, Commented Aug 11, 2012 at 14:27
for e.g these characters(どうしようま) are from japanese language. I would store the start code point for that language as per the UTF8 encoding & then compress the byte[] by storing the difference from start code point for each character instead of entire code point which wouldn't fit in single byte — Rajat Gupta
– Rajat Gupta, Commented Aug 11, 2012 at 14:33
I'm converting UTF8 string to byte[]. and I guess by looking at each character of that UTF8 string I could know the language used through the code point range of that charcter. (I don't need to determine character encoding of byte[] as I know it is converted from UTF8 string. To get string back from byte[] I would first uncompress the string using the language of string used while compression & then restore UTF8 string from uncompressed byte[]). — Rajat Gupta
– Rajat Gupta, Commented Aug 11, 2012 at 14:44

Bobulous · Accepted Answer · 2012-08-11 15:36:00Z

1

Take a look at the Character.UnicodeBlock class, which provides the static method of(char) and of(int) to detect the Unicode block of a character. This will tell you whether a character is, for example, from the ARABIC block or from the BASIC_LATIN block.

However, notice that there are several *LATIN* blocks, and many languages need to use characters from several blocks. So working out what language is being provided to you is going to be very hard work. I can think of no way to automatically detect this.

Also bear in mind that many Unicode blocks are enormous, and there's no way that you'll be able to fit all valid characters from a single language into just one byte. (Take a look at the Unicode 6.1 Character Code Charts to appreciate just how vast Unicode is.) So, honestly, you are not going to be able to compress every character into a single byte.

UTF-8 is the result of years of internationalization standards, and it's probably the best option for any software which needs to represent multiple languages. Trying to produce something more efficient will probably cost you a huge amount of time, and result in only small gains.

answered Aug 11, 2012 at 15:36

Bobulous

13.2k5 gold badges39 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Stuart Marks Over a year ago

+1. I'd like to emphasize that the OP's assumption that the chars for a particular language fit within a byte is incorrect.

Collectives™ on Stack Overflow

UTF8 string to byte[] with each character as single byte

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related