Java UTF8 encoding

Question

I have a scenario in which some special characters are stored in a database (sybase) in the system's default encoding and I have to fetch this data and send it to a third-party in UTF-8 encoding using a Java program.

There is precondition that the data sent to the third-party should not exceed a defined maximum size. Since upon conversion to UTF-8 a character may be replaced by 2 or 3 characters then my logic dictates that after getting the data from the database I must encode it into the UTF-8 string and then split the string. The following are my observations:

When any special character like Chinese or Greek characters or any special character > ASCII 256 is encountered and when I convert it into UTF-8, a single character maybe represented by more than 1 byte.

So how can I be sure that the conversion is proper? For conversion I am using the following

// storing the data from database into string
string s = getdata from the database;

// converting all the data in byte array utf8 encoding
byte [] b = s.getBytes("UTF-8");

// creating a new string as my split logic is based on the string format

String newString = new String(b,"UTF-8");

But when I output this newString to the console I get ? for the special characters.

So I have some doubts:

If my conversion logic is wrong , then how could I correct it.
After doing my conversion to UTF-8, can I double-check whether my conversion is OK or not? I mean is it the correct message which needs to be sent to the third-party, I assume that if the message is not user-readable after conversion then there is some problem with the conversion.

Would like to have some points of view from all the experts out there.

Please do let me know if any further info is needed from my side.

That seems to be a problem with your console, rather than the conversion, which as far as I can tell, is okay. Have you tried writing it into a text file instead of the console and opening it with a text editor? — biziclop
– biziclop, Commented Jan 17, 2011 at 19:56
have you tried outputting the original string? The font used by your console may not contain these characters — josefx
– josefx, Commented Jan 17, 2011 at 20:17
The task of converting from unicode to utf-8 and getting the characters to display properly is not without it's issues. A contact found a solution last year. I'll ask him how he managed to get this working. — James P.
– James P., Commented Jan 17, 2011 at 20:23

Adrian Pronk · Accepted Answer · 2011-08-21 09:07:53Z

2

You say you're writing the Unicode to a text file, but that requires a conversion from Unicode.

But a conversion to what? That depends on how you open the file.

For example, System.out.println(myUnicodeString) will convert the Unicode to the encoding that System.out was constructed with, most likely your platform's default encoding. If you're running Windows, then this is likely to be windows-1252.

If you tell Java to use UTF-8 encoding when it writes to a file, you'll get a file containing UTF-8:

PrintWriter pw = new PrintWriter(new FileOutputStream("filename.txt"), "UTF-8");
pw.println(myUnicodeString);

answered Aug 21, 2011 at 9:07

Adrian Pronk

14k7 gold badges39 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ddyer · Accepted Answer · 2011-01-17 21:10:50Z

0

Java strings are unicode, but not all java components support full unicode strings, especially AWT components and lightweight swing components. So you may have perfectly good strings, but get junk in your console output.

answered Jan 17, 2011 at 21:10

ddyer

1,80419 silver badges27 bronze badges

Comments

one_pacifist · Accepted Answer · 2011-01-18 07:30:24Z

0

thanks all for your replies..

As suggested by some of you , I already tried writing it to a text file , however in text file also I got ? for the my special characters. So i have the following observations:-

a) Encoding is a two fold process, frst u change the string from one encoding to another encoding on byte level and then u also have to have the required font for the new character set.

b) If we are encoding some string that means we are encoding the bytes , for the current scenario, I am using the double quotes from the MS word and then inserting into a sybase databse, and after fetching the data from db , i am writing it to a txt file , where i am getting the same ? for double quotes , however if i directly copy the same stuff from the db to MS word or edit plus I can see the actual characters . so i am not able to comprehend this problem. As per my understanding, during encoding we should be concerned only about the byte value which are the real representations and not the string object whcih we constitute out of these byte arrays.However, unless my encoded information is not human readable how can other party validate it and read it (I am guessing these would be reading bytes , but if for a special character some ? like junk character has been introduced while utf8 encoding , then is not is an info loss).

Would really appreciate your views on my observations and what correct approach should I follow further?

answered Jan 18, 2011 at 7:30

one_pacifist

1

1 Comment

ddyer Over a year ago

the ? only tell you that the program you're using to view your text is also unhappy. You can't tell what's really going on until you see the bits. Use a hexdump tool to view some sample text.

Rob Audenaerde · Accepted Answer · 2013-02-25 20:28:55Z

0

Please use a hex-editor to verify if your output is correctly formatted UTF8. There is no other way to tell for sure if what you see is corrector not.

And read this if you have not ready: http://www.joelonsoftware.com/articles/Unicode.html

answered Feb 25, 2013 at 20:28

Rob Audenaerde

20.4k12 gold badges83 silver badges140 bronze badges

Comments

DarioBB · Accepted Answer · 2015-04-06 13:24:58Z

0

Use this for proper converstion - this one is from iso-8859-1 to utf-8:

public String to_utf8(String fieldvalue) throws UnsupportedEncodingException{

        String fieldvalue_utf8 = new String(fieldvalue.getBytes("ISO-8859-1"), "UTF-8");
        return fieldvalue_utf8;
}

edited Apr 6, 2015 at 13:24

answered Apr 6, 2015 at 13:15

DarioBB

6733 gold badges10 silver badges29 bronze badges

Collectives™ on Stack Overflow

Java UTF8 encoding

5 Answers 5

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related