8

In a Java program, I spawn a new Process via ProcessBuilder.

args[0] = directory.getAbsolutePath() + File.separator + program;
ProcessBuilder pb = new ProcessBuilder(args);
pb.directory(directory);
final Process process = pb.start();

Then, I read the process standard output with a new Thread

new Thread() {
    public void run() {
        BufferedReader reader = new BufferedReader(
            new InputStreamReader(process.getInputStream()));
        String line = "";
        while ((line = reader.readLine()) != null) {
            System.out.println(line);
    }
}.start();

However, when the process outputs non-ASCII characters (such as 'é'), the line has character '\uFFFD' instead.

What is the encoding in the InputStream returned by getInputStream (my platform is Windows in Europe)?

How can I change things so that line contains the expected data (i.e. '\u00E9' for 'é')?

Edit: I tried new InputStreamReader(...,"UTF-8"): é becomes \uFFFD

2
  • BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8")); Commented Dec 6, 2011 at 10:30
  • @Cris please write an answer rather than a comment, if you want to answer Commented Dec 6, 2011 at 10:43

8 Answers 8

9

An InputStream is a binary stream, so there is no encoding. When you create the Reader, you need to know what character encoding to use, and that would depend on what the program you called produces (Java will not convert it in any way).

If you do not specify anything for InputStreamReader, it will use the platform default encoding, which may not be appropriate. There is another constructor that allows you to specify the encoding.

If you know what encoding to use (and you really have to know):

new InputStreamReader(process.getInputStream(), "UTF-8") // for example
Sign up to request clarification or add additional context in comments.

4 Comments

And as @AlexR points out, the same reasoning applies to writing data, too.
UTF-8 is the default encoding in Java, so "UTF-8" cannot help. The solution is close, it just needs "Cp1252" or "ISO-8859-1" (depending on what getInputStream() returns)
UTF-8 is not the default encoding in Java. There is no default at all, it always uses something platform dependent (which can be controlled by environment variables and system properties). Not something an application developer should usually rely on. Better to always be explicit in what encoding you want.
UTF-16 is java's standard internal representation of characters. Hence the unsigned 16-bit 'char' primitive. The InputStreamReader will ALWAYS convert to UTF-16. Although the InputStream is a binary stream, if it represents characters the bytes will follow whatever encoding was used to create the resource. The InputStreamReader constructor mentioned by Thilo includes an argument to specify the encoding of that resource - how the stream should be treated.
9

Interestingly enough, when running on Windows:

ProcessBuilder pb = new ProcessBuilder("cmd", "/c dir");
Process process = pb.start();

Then CP437 code page works quite well for

new InputStreamReader(process.getInputStream(), "CP437");

5 Comments

As other sais the InputStream contains characters in the platform encoding. Since I have a modern operating system, I have UTF-8; since you have Windows, you have CP437.
Thanks, CP437 was the only charset name that worked for me (Windows + Spanish characters)
Actually, nowadays, that should be CP850. The odd thing is that it seems all the windows system is set to windows-1252/cp1252 (at least in western europe), but the console uses CP850 specifically instead. CP437 is the ancestor of CP850. Opening the command prompt and running "chcp" should tell you exactly which encoding is it using to print char data.
Also, the encoding to use for parsing the InputStream depends on what program the ProcessBuilder is built around. Let's say for example : CP850 for cmd, windows-1252 for some other windows tools you might invoke directly (without wrapping them in cmd), and possibly UTF-8 if the program you're calling outputs UTF-8. This is program-specific and should be looked up in the program's documentation.
Nice! I have checked some windows 10 settings. For various europian settings, it's CP850, but for defaultians (US settings), it still remains CP437.
4

As I understand, an operation system streams are byte-streams, there are no characters here. The InputStreamReader constructor uses jvm default character set java.nio.charset.Charset#defaultCharset(), you could use another constructor to explicitly specify a character set.

1 Comment

Yes, I had to new InputStreamReader(...,"ISO-8859-1")
2

According to http://www.fileformat.info/info/unicode/char/e9/index.htm '\uFFFD' is a unicode code for character 'é'. It actually means that you are reading the stream correctly. Your problem is in writing.

Windows console does not support unicode by default. So, if you want to test your code open file and write your stream there. But do not forget to set the encoding UTF-8.

2 Comments

Correct. new PrintWriter(OutputStreamWriter(..., "Cp1252")) where Cp1252 is the Latin-1 with Windows extension, as used in a small part of western Europe (France, Germany and some).
Why do you point to character (0xE9 that I want) when I have character 0xFFFD aka 'REPLACEMENT CHARACTER' fileformat.info/info/unicode/char/fffd/index.htm
2

Scientific

On Windows this works perfect:

private static final Charset CONSOLE_ENCODING;
static {
    Charset enc = Charset.defaultCharset();
    try {
        String example = "äöüßДŹす";
        String command = File.separatorChar == '/' ? "echo " + example : "cmd.exe /c echo " + example;
        Process exec = Runtime.getRuntime().exec(command);
        InputStream inputStream = exec.getInputStream();
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        while (exec.isAlive()) {
            Thread.sleep(100);
        }
        byte[] buff = new byte[inputStream.available()];
        if (buff.length > 0) {
            int count = inputStream.read(buff);
            baos.write(buff, 0, count);
        }

        byte[] array = baos.toByteArray();
        for (Charset charset : Charset.availableCharsets().values()) {
            String s = new String(array, charset);
            if (s.equals(example)) {
                enc = charset;
                break;
            }
        }
    } catch (InterruptedException e) {
        throw new Error("Could not determine console charset.", e);
    } catch (IOException e) {
        throw new Error("Could not determine console charset.", e);
    }
    CONSOLE_ENCODING = enc;
}

According to specification: there is no hint for runtime-encoding change of jvm. We can not be sure that the encoding does NOT change while running and the charset still correct after such change.

1 Comment

Hmmm... nice idea, but it actually it doesn't work on my system (Windows 7 SP1, 64-bit, Java 8 build 71) -- none of the available encodings produces the original string. The problem seems to be that the given example string is not even correctly transferred to the system, producing "?" characters instead. Apart of that, I also get an additional "\r\n" endline in the output.
1

If you, like me, know in what encoding you want to use for all input/output, you can either encode it in the Java API calls to some (not all) CreateReader methods, which some other answers have pointed out.

But this will hard code it in the source, which might or might not, be ok.

I found a better way after reading this answer which reveals that you can set the encoding before the JVM starts up to what you need.

java -Dfile.encoding=ISO-8859-1 ...

Comments

0

I put this as a comment but i see there was an answer after ,so it might be redundant now :)

BufferedReader br = new BufferedReader(
    new InputStreamReader(conn.getInputStream(), "UTF-8"));

1 Comment

UTF-8 is the default encoding. So, this does not help.
0

use commons-lang jar file in this use - StringEscapeUtils.escapeHtml

BufferedReader br = new BufferedReader(
    new InputStreamReader(StringEscapeUtils.escapeHtml(conn.getInputStream()));

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.