1

HI! I have a web page content in encoded in ISO-8859-2. How to convert a stream encoded in this charset to java's UTF-8. I'm trying the code below, but it does not work. It messes up some characters. Is there some other way to do this?

    BufferedInputStream inp = new BufferedInputStream(in);
    byte[] buffer = new byte[8192];
    int len1 = 0;
    try{
        while ( (len1 = inp.read(buffer)) != -1 ) 
        {

            String buff = new String(buffer,0,len1,"ISO-8859-2");
            stranica.append(buff);
        } 
1
  • You should re-tag this "Java" not "Android" Commented Jun 23, 2010 at 0:48

2 Answers 2

4

Try it with an InputStreamReader and Charset:

InputStreamReader inp = new InputStreamReader(in, Charset.forName("ISO-8859-2"));
BufferedReader rd = new BufferedReader(inp);
String l;
while ((l = rd.readLine()) != null) {
   ...
}

If you get an UnsupportedCharsetException, you know what's your problem... Also, with inp.getEncoding() you can check which encoding is really used.

Sign up to request clarification or add additional context in comments.

3 Comments

it seems that the problem was that the encoding parameter should be "ISO8859-2" and not "ISO-8859-2"...
I doubt that. ISO-8859-2 and ISO8859-2 are both valid names for that encoding, and Java recognizes both of them.
I have some Croatian text in an URL, and tried to download the contents but it show rectangle in some text. I posted my question at stackoverflow.com/questions/17574928/… can you help me.
3

How to convert a stream encoded in this charset to java's UTF-8

Wrong assumption: Java uses UTF-16 internally, not UTF-8.

But your code actually looks correct and should work. Are you absolutely sure the webpage is in fact encoded in ISO-8859-2? Maybe its encoding is declared incorrectly.

Or perhaps the real problem is not with the reading code that you've shown, but with whatever code you use to work with the result. How and where do these "messed up characters" manifest?

5 Comments

i know that about utf-16, but, when a web page has in it's head (or whatever it's called) utf-8 declared, everything works perfectly. when ISO-8859-2 is declared, certain Croatian characters like (Č,ć,š,ć,đ,ž) end up being displayed as ?.
@Levara: Do those webpages look correct when you open them in a browser? If that displays '?' too, then it looks as though the webpage contents were corrupted by whatever program produced them. Nothing you do at this point can fix that.
Yes. they are correctly displayed in browser. That's why I'm sure it's possible, I just don't know how to do it. :)
@Levara: then, as I wrote, the problem is with whatever you do with the data after you have read it. where are the characters displayed as '?'
I'm displaying it in textview in android. It works now, it seems that the problem was that the encoding parameter should be "ISO8859-2" and not "ISO-8859-2"... thanks anyway.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.