UTF-16 Encoding in Java versus C#

Question

I am trying to read a String in UTF-16 encoding scheme and perform MD5 hashing on it. But strangely, Java and C# are returning different results when I try to do it.

The following is the piece of code in Java:

public static void main(String[] args) {
    String str = "preparar mantecado con coca cola";
    try {
        MessageDigest digest = MessageDigest.getInstance("MD5");
        digest.update(str.getBytes("UTF-16"));
        byte[] hash = digest.digest();
        String output = "";
        for(byte b: hash){
            output += Integer.toString( ( b & 0xff ) + 0x100, 16).substring( 1 );
        }
        System.out.println(output);
    } catch (Exception e) {

    }
}

The output for this is: 249ece65145dca34ed310445758e5504

The following is the piece of code in C#:

   public static string GetMD5Hash()
        {
            string input = "preparar mantecado con coca cola";
            System.Security.Cryptography.MD5CryptoServiceProvider x = new System.Security.Cryptography.MD5CryptoServiceProvider();
            byte[] bs = System.Text.Encoding.Unicode.GetBytes(input);
            bs = x.ComputeHash(bs);
            System.Text.StringBuilder s = new System.Text.StringBuilder();
            foreach (byte b in bs)
            {
                s.Append(b.ToString("x2").ToLower());
            }
            string output= s.ToString();
            Console.WriteLine(output);
        }

The output for this is: c04d0f518ba2555977fa1ed7f93ae2b3

I am not sure, why the outputs are not the same. How do we change the above piece of code, so that both of them return the same output?

Compare your byte arrays first. If they mismatch in even a single bit, the hashes are completely different. There may a BOM or whatever in the UTF-16 encoding. It may be little or big endian, or whatever. — maaartinus
– maaartinus, Commented Jan 25, 2011 at 12:32

Nordic Mainframe · Accepted Answer · 2011-01-25 12:31:56Z

35

UTF-16 != UTF-16.

In Java, getBytes("UTF-16") returns an a big-endian representation with optional byte-ordering mark. C#'s System.Text.Encoding.Unicode.GetBytes returns a little-endian representation. I can't check your code from here, but I think you'll need to specify the conversion precisely.

Try getBytes("UTF-16LE") in the Java version.

answered Jan 25, 2011 at 12:31

Nordic Mainframe

28.9k10 gold badges69 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

debracey Over a year ago

It's worth noting that if you look at the output in eclipse, it still doesn't match what Visual Studio shows you. But strangely it does work...

Jacek Cz Over a year ago

2015, Java 8.0 * .NET 4.0.x tests based on Polish language, seems be OK like Yoy write. Bytes in both languages are identical, and have not BOM prefix. Next important field for tests: Java arithmetic accept overflow silently (good for hash), C# by default not

Mark McKenna · Accepted Answer · 2011-01-25 12:33:39Z

5

The first thing I can find, and this might not be the only problem, is that C#'s Encoding.Unicode.GetBytes() is littleendian, while Java's natural byte order is bigendian.

answered Jan 25, 2011 at 12:33

Mark McKenna

2,9201 gold badge20 silver badges17 bronze badges

Comments

Neonamu · Accepted Answer · 2011-01-25 12:36:45Z

0

You could use the System.Text.Enconding.Unicode.GetString(byte[]) to convert back from byte to string. In this way you're sure that all happens in Unicode encoding.

answered Jan 25, 2011 at 12:36

Neonamu

7361 gold badge7 silver badges21 bronze badges

Collectives™ on Stack Overflow

UTF-16 Encoding in Java versus C#

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related