13

I want get the encoding from a stream.

1st method - to use the InputStreamReader.

But it always return OS encode.

InputStreamReader reader = new InputStreamReader(new FileInputStream("aa.rar"));
System.out.println(reader.getEncoding());

output:GBK

2nd method - to use the UniversalDetector.

But it always return null.

    FileInputStream input = new FileInputStream("aa.rar");

    UniversalDetector detector = new UniversalDetector(null);
    byte[] buf = new byte[4096];

    int nread;
    while ((nread = input.read(buf)) > 0 && !detector.isDone()) {
        detector.handleData(buf, 0, nread);
    }

    // (3)
    detector.dataEnd();

    // (4)
    String encoding = detector.getDetectedCharset();

    if (encoding != null) {
        System.out.println("Detected encoding = " + encoding);
    } else {
        System.out.println("No encoding detected.");
    }

    // (5)
    detector.reset();

output:null

How can I get the right? :(

5
  • 4
    InputStreamReader will always use platform encoding. It does not attempt to detect encoding in files. What type of files are you running through UniversalDetector? In your example you used a RAR file, which is a compressed binary format. Try with a simple ASCII text file first. Commented Nov 29, 2011 at 4:37
  • hi, i'm changed the file type, 'Fortunes.txt' output:No encoding detected Commented Nov 29, 2011 at 5:03
  • It doesn't seem to detect 'standard' UTF-8 or UTF-16 without a BOM, but it worked for UTF-16 with a BOM for me. Maybe consider using a different library for charset detection? This link might help. Commented Nov 29, 2011 at 6:36
  • 3
    Detecting encodings by inspecting text data is unreliable guesswork. You really need to have the encoding specified as metadata somewhere to be sure. Commented Nov 29, 2011 at 9:33
  • @Michael Borwardt: but in many cases you do not have any metadata specifying the encoding and you do not have any specs telling you in which encoding the txt files you need to parse will be encoded. In these cases the "guesswork" done by things like: www-archive.mozilla.org/projects/intl/… (using letters frequency in addition to a lot of other heuristics) seems to be quite "scientific" a guesswork. All is not always black and white. When you do not have metadata, you do not say: "I need metadata" but you work hard and you write (or reuse) a detector. Commented Nov 29, 2011 at 12:59

2 Answers 2

7

Let's resume the situation:

  • InputStream delivers bytes
  • *Readers deliver chars in some encoding
  • new InputStreamReader(inputStream) uses the operating system encoding
  • new InputStreamReader(inputStream, "UTF-8") uses the given encoding (here UTF-8)

So one needs to know the encoding before reading. You did everything right using first a charset detecting class.

Reading http://code.google.com/p/juniversalchardet/ it should handle UTF-8 and UTF-16. You might use the editor JEdit to verify the encoding, and see whether there is some problem.

Sign up to request clarification or add additional context in comments.

3 Comments

We can use other tools to achieve, but it can't understand the specific treatment method,Seems to be to deal with. :(
Juniversalchardet doesn't support ISO-8859-1, which is a very common charset
@Thomas universalchardet originates from the browser area, where ISO-8859-1 is reinterpreted as Windows-1252 (officially since HTML 5), so maybe Window-1252 aka Cp1252 works. YES, checked
0
    public String getDecoder(InputStream inputStream) {

    String encoding = null;

    try {
        byte[] buf = new byte[4096];
        UniversalDetector detector = new UniversalDetector(null);
        int nread;

        while ((nread = inputStream.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }

        detector.dataEnd();
        encoding = detector.getDetectedCharset();
        detector.reset();

        inputStream.close();

    } catch (Exception e) {
    }

    return encoding;
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.