Why java.net.URLEncoder gives different result for same string?

Question

On the webapp server when I try encoding "médicaux_Jérôme.txt" using java.net.URLEncoder it gives following string:

me%CC%81dicaux_Je%CC%81ro%CC%82me.txt

While on my backend server when I try encoding the same string it gives following:

m%C3%A9dicaux_J%C3%A9r%C3%B4me.txt

Can someone help me understanding the different output for the same input? Also how can I get standardized output each time I decode the same string?

Community · Accepted Answer · 2020-06-20 09:12:55Z

4

The outcome depends on the platform, if you don't specify it.

See the java.net.URLEncoder javadocs:

encode(String s)

Deprecated.

The resulting string may vary depending on the platform's default encoding. Instead, use the encode(String,String) method to specify the encoding.

So, use the suggested method and specify the encoding:

String urlEncodedString = URLEncoder.encode(stringToBeUrlEncoded, "UTF-8")

About different representations for the same string, if you specified "UTF-8":

The two URL encoded strings you gave in the question, although differently encoded, represent the same unencoded value, so there is nothing inherently wrong there. By writing both in a decode tool, we can verify that they are the same.

This is due, as we are seeing in this case, to the fact that there are multiple ways to URL encode the same string, specially if they have acute accents (due to the combining acute accent, precisely what happens in your case).

To your case, specifically, the first string encoded é as e + ´ (latin small letter e + combining acute accent) resulting in e%CC%81. The second encoded é directly to %C3%A9 (latin small letter e with acute - two % because in UTF-8 it takes two bytes).

Again, there is no problem with either representation. Both are forms of Unicode Normalization. It is known that Mac OS Xs tend to encode using the combining acute accent; in the end, it is a matter of preference of the encoder. In your case, there must be different JREs or, if that file name was user generated, then the user may have used a different OS (or tool) that generated that encoding.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Apr 9, 2015 at 22:30

acdcjunior

136k37 gold badges341 silver badges312 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

dev2d Over a year ago

Thanks for your answer, but I have already specified UTF-8 encoding at both places!

acdcjunior Over a year ago

Their VMs must be using different implementations, then. You see, the two URL encoded strings you gave, although differently encoded, represent the same unencoded value, so there is nothing inherently wrong there. Try writing both in this online tool. You'll see they are the same.

dev2d Over a year ago

I understand that, and totally agree with what you are saying, but I am unable to understand what may be causing that issue, is there any why I can find that out?

acdcjunior Over a year ago

Can you tell the VMs used in both environments?

acdcjunior Over a year ago

By VM I mean JRE, are they the same version on both OSs? Are there any differences between the OSs (french locale vs english locale)? What originated the original (before encoding) strings? Were they on a database? File? User input? If an user originated them, do you know the user's OS?

Collectives™ on Stack Overflow

Why java.net.URLEncoder gives different result for same string?

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related