2

I'm parsing some image links on wikipedia. I came across this one on http://en.wikipedia.org/wiki/Special:Export/Diego_Forl%C3%A1n

When i use the deprecated URLEncoder.encode, i can encode accented chars correctly, but when i specify the "UTF-8" argument, it fails. The text on wikipedia is utf8 AFAIK.

Diego+Forl%C3%A1n+vs+the+Netherlands.jpg is correct whereas Diego+Forl%E2%88%9A%C2%B0n+vs+the+Netherlands.jpg is incorrect.

scala> first
res24: String = Diego Forlán vs the Netherlands.jpg

scala> java.net.URLEncoder.encode(first, "UTF-8")
res25: java.lang.String = Diego+Forl%E2%88%9A%C2%B0n+vs+the+Netherlands.jpg

scala> java.net.URLEncoder.encode(first)
<console>:33: warning: method encode in object URLEncoder is deprecated: see corresponding Javadoc for more information.
              java.net.URLEncoder.encode(first)
                                  ^
res26: java.lang.String = Diego+Forl%C3%A1n+vs+the+Netherlands.jpg
3
  • Works fine in Java 1.6.0_27-b07 Commented Nov 19, 2011 at 0:56
  • using os x lion (build 1.6.0_26-b03-383-11A511c) Commented Nov 19, 2011 at 1:01
  • What's not working about the result? You haven't indicated how it is incorrect. Accented characters in UTF-8 are often multi-byte. URL Encoding those multiple bytes would end up with something like you have in the second case. Commented Nov 19, 2011 at 1:43

1 Answer 1

2

I would guess that first is already corrupt and is only rendering correctly due to a transcoding bug hidden by your console configuration.

You can confirm this by emitting the UTF-16 code units in the string:

for(c<-first.toCharArray()){print("\\u%04x".format(c.toInt))}

There is probably a more elegant way to write that.

If the code point is encoded correctly, it will be:

U+00e1      á       \u00e1

I expect somewhere UTF-8 encoded data is being decoded using a MacRoman decoder.

codepoint   glyph   escaped    x-MacRoman     info
=======================================================================
U+221a      √       \u221a     c3,            MATHEMATICAL_OPERATORS, MATH_SYMBOL
U+00b0      °       \u00b0     a1,            LATIN_1_SUPPLEMENT, OTHER_SYMBOL
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.