Why does UTF-8 encoding not work for special character in process input stream?

Question

I got my last question marked as duplicated as question Which encoding does Process.getInputStream() use?. While actually that's not what I'm asking. In my second example, UTF-8 can successfully parse the special character. However, when the special character is read from the process input stream, it cannot be parsed correctly by UTF-8 anymore. Why does this happen and does that mean ISO_8859_1 is the only option I can choose.

I'm working on a plugin which can retrieve the Azure key vault secret in runtime. However, there's one encoding issue. I stored a string contains special character ç, the string is as follows: HrIaMFBc78!?%$timodagetwiçç99. However, with following program, the special character ç cannot be parsed correctly:

package com.buildingblocks.azure.cli;

import java.io.*;
import java.nio.charset.StandardCharsets;

public class Test {
    static String decodeText(String command) throws IOException, InterruptedException {
        Process p;
        StringBuilder output = new StringBuilder();
        p = Runtime.getRuntime().exec("cmd.exe /c \"" + command + "\"");
        p.waitFor();
        InputStream stream;
        if (p.exitValue() != 0) {
            stream = p.getErrorStream();
        } else {
            stream = p.getInputStream();
        }
        BufferedReader reader = new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8));
        String line = "";
        while ((line = reader.readLine()) != null) {
            output.append(line + "\n");
        }
        return output.toString();
    }

    public static void main(String[] arg) throws IOException, InterruptedException {
        System.out.println(decodeText("az keyvault secret show --name \"test-password\" --vault-name \"test-keyvault\""));
    }
}

The output is: "value": "HrIaMFBc78!?%$timodagetwi��99"

If I use following program to parse the String, the special character ç can be parsed successfully.

package com.buildingblocks.azure.cli;

import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

public class Test {
    static String decodeText(String input, String encoding) throws IOException {
        return
                new BufferedReader(
                        new InputStreamReader(
                                new ByteArrayInputStream(input.getBytes()),
                                Charset.forName(encoding)))
                        .readLine();
    }

    public static void main(String[] arg) throws IOException {
        System.out.println(decodeText("HrIaMFBc78!?%$timodagetwiçç99", StandardCharsets.UTF_8.toString()));
    }
}

Both of them are using the BufferedReader with the same setup, but the one parsing the output from process failed. Does anybody know the reason for this?

cmd.exe suggests windows server is used. Are you sure it runs in UTF-8? you should use the character set the platform is using. Actually, java should default to platform-native encoding. — eis
– eis, Commented Sep 8, 2021 at 15:44
I think use Charset.defaultCharset() to get the encoding of the system. Don't use just ISO_8859_1 on all windows platforms. — the Hutt
– the Hutt, Commented Sep 8, 2021 at 16:35

de-jcup · Accepted Answer · 2021-09-08 16:10:53Z

1

You are reading with UTF-8

 BufferedReader reader = new BufferedReader(
        new InputStreamReader(stream, StandardCharsets.UTF_8));

Your second example does write the String as UTF-8 so it can be read with the former mentioned code and works well.

But your first example does execute cmd.exe (so Windows OS) and fetches the returned stream data by OS. At Windows you normally have CP1252 as default charset which is not UTF-8.

You could either setup the default character encoding for Windows to UTF-8 - please look at Save text file in UTF-8 encoding using cmd.exe for an HowTo. Or you just use the system encoding of your OS (At Windows normally CP1252) at your input stream reader creation (instead StandardCharsets.UTF_8).

answered Sep 8, 2021 at 16:10

de-jcup

1,95517 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

the Hutt Over a year ago

better make system encoding as an application setting or use Charset.defaultCharset(). Assuming CP1252 is not good.

de-jcup Over a year ago

Yes an assumption is not very good - either determine the OS charset programmatically and use it for reading or an application setting as suggested by you - good point.

hobgoblin · Accepted Answer · 2021-09-08 15:36:56Z

0

The ç in has two bytes in UTF-8 encoding, so two of them would be four bytes. The two place holder characters � suggest that only two bytes were there. In ISO 8859-1 encoding, a ç has one byte, so this suggests that the encoding was not UTF-8, but may have been ISO 8859-1.

The InputStream does not use any encoding, it just transfers the bytes. The encoding is used in the InputStreamReader.

A hex-dump of the input might be useful. Alternatively, you can try to interpose a script between the Java program and the program you want to call, and analyse the situation there. Or just try with ISO 8859-1 instead.

answered Sep 8, 2021 at 15:36

hobgoblin

3472 silver badges5 bronze badges

Comments

erickson · Accepted Answer · 2021-09-08 16:02:28Z

The charset you select in Java should match the encoding used by the command you execute. It's not UTF-8, and is probably ISO-8859-1. Because the encoding used by the command is likely to default to something different on different machines, you might try setting it explicitly to a known value before executing your command:

chcp 65001 && <command>

Or, in your context:

Runtime.getRuntime().exec("cmd.exe /c \"chcp && " + command + "\"");

Windows code page 65001 is UTF-8.

Note that failing to consume the output of the subprocess can cause it to block, and never terminate, so your waitFor() may block because you consume the output afterward. The standard output of the process may have a large enough buffer to complete, but if there is output to standard error, it is more likely to block. An alternative is to direct standard error to the stderr of the parent Java process.

Collectives™ on Stack Overflow

Why does UTF-8 encoding not work for special character in process input stream?

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related