3

On the webapp server when I try encoding "médicaux_Jérôme.txt" using java.net.URLEncoder it gives following string:

me%CC%81dicaux_Je%CC%81ro%CC%82me.txt

While on my backend server when I try encoding the same string it gives following:

m%C3%A9dicaux_J%C3%A9r%C3%B4me.txt

Can someone help me understanding the different output for the same input? Also how can I get standardized output each time I decode the same string?

1 Answer 1

4

The outcome depends on the platform, if you don't specify it.

See the java.net.URLEncoder javadocs:

encode(String s)

Deprecated

The resulting string may vary depending on the platform's default encoding. Instead, use the encode(String,String) method to specify the encoding.

So, use the suggested method and specify the encoding:

String urlEncodedString = URLEncoder.encode(stringToBeUrlEncoded, "UTF-8")

About different representations for the same string, if you specified "UTF-8":

The two URL encoded strings you gave in the question, although differently encoded, represent the same unencoded value, so there is nothing inherently wrong there. By writing both in a decode tool, we can verify that they are the same.

This is due, as we are seeing in this case, to the fact that there are multiple ways to URL encode the same string, specially if they have acute accents (due to the combining acute accent, precisely what happens in your case).

To your case, specifically, the first string encoded é as e + ´ (latin small letter e + combining acute accent) resulting in e%CC%81. The second encoded é directly to %C3%A9 (latin small letter e with acute - two % because in UTF-8 it takes two bytes).

Again, there is no problem with either representation. Both are forms of Unicode Normalization. It is known that Mac OS Xs tend to encode using the combining acute accent; in the end, it is a matter of preference of the encoder. In your case, there must be different JREs or, if that file name was user generated, then the user may have used a different OS (or tool) that generated that encoding.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for your answer, but I have already specified UTF-8 encoding at both places!
Their VMs must be using different implementations, then. You see, the two URL encoded strings you gave, although differently encoded, represent the same unencoded value, so there is nothing inherently wrong there. Try writing both in this online tool. You'll see they are the same.
I understand that, and totally agree with what you are saying, but I am unable to understand what may be causing that issue, is there any why I can find that out?
Can you tell the VMs used in both environments?
By VM I mean JRE, are they the same version on both OSs? Are there any differences between the OSs (french locale vs english locale)? What originated the original (before encoding) strings? Were they on a database? File? User input? If an user originated them, do you know the user's OS?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.