Java Charset InputStreamReader, File Channel Differences

Question

I'm trying to read a (Japanese) file that is encoded as a UTF-16 file.

When I read it using an InputStreamReader with a charset of 'UTF-16" the file is read correctly:

try {
        InputStreamReader read = new InputStreamReader(new FileInputStream("JapanTest.txt"), "UTF-16");
        BufferedReader in = new BufferedReader(read);
        String str;
        while((str=in.readLine())!=null){           
            System.out.println(str);
    }
    in.close();
}catch (Exception e){
    System.out.println(e);
}

However, when I use File Channels and read from a byte array the Strings aren't always converted correctly:

    File f = new File("JapanTest.txt");
    fis = new FileInputStream(f);
    channel = fis.getChannel();
     MappedByteBuffer buffer = channel.map( FileChannel.MapMode.READ_ONLY, 0L, channel.size());
     buffer.position(0);
    int get = Math.min(buffer.remaining(), 1024);
    byte[] barray = new byte[1024];
    buffer.get(barray, 0, get);
    CharSet charSet = Charset.forName("UTF-16");
    //endOfLinePos is a calculated value and defines the number of bytes to read
    rowString = new String(barray, 0, endOfLinePos, charSet);               
    System.out.println(rowString);

The problem I've found is that I can only read characters correctly if the MappedByteBuffer is at position 0. If I increment the position of the MappedByteBuffer and then read a number of bytes into a byte array, which is then converted to a string using the charset UTF-16, then the bytes are not converted correctly. I haven't faced this issue if a file is encoded in UTF-8, so is this only an issue with UTF-16?

More Details: I need to be able to read any line from the file channel, so to do this I build a list of line ending byte positions and then use those positions to be able to get the bytes for any given line and then convert them to a string.

probably endOfLinePos is calculated wrong; if it's an odd number, it's trouble since UTF-16 requires even number of bytes. — irreputable
– irreputable, Commented Dec 18, 2012 at 6:31
I suspect that possibly the 1024 character groups are splitting a UTF-16 character in "the middle." — Louis Wasserman
– Louis Wasserman, Commented Dec 18, 2012 at 7:18

JimmyB · Accepted Answer · 2012-12-18 06:29:23Z

2

Possibly, the InputStreamReader does some transformations the normal new String(...) does not. As a work-around (and to verify this assumption) you could try to wrap the data read from the channel like new InputStreamReader( new ByteArrayInputStream( barray ) ).

Edit: Forget that :) - Channels.newReader() would be the way to go.

answered Dec 18, 2012 at 6:29

JimmyB

12.8k2 gold badges33 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

joechip Over a year ago

I need to be able to read any line from the file channel, so to do this I build a list of line ending byte positions and then use those positions to be able to get the bytes for any given line and then convert them to a string.

JimmyB Over a year ago

For line-wise processing the LineNumberReader is used.

joechip Over a year ago

True - but I want to use File Channels because the files can be large so it will save a lot of memory.

JimmyB Over a year ago

So, why not use new LineNumberReader( channel.newReader(...) )? - It will read the file sequentially, buffering only a small amount of data (kilobytes maybe) at any instant.

Esailija · Accepted Answer · 2012-12-18 11:20:13Z

1

The code unit of UTF-16 is 2 bytes, not a byte like UTF-8. The pattern and single byte code unit length makes UTF-8 self-synchronizing; it can read correctly at any point and if it's a continuation byte, it can either backtrack or lose only a single character.

With UTF-16 you must always work with pairs of bytes, you cannot start reading at an odd byte or stop reading at an odd byte. You also must know the endianess, and use either UTF-16LE or UTF-16BE when not reading at the start of the file, because there will be no BOM.

You can also encode the file as UTF-8.

edited Dec 18, 2012 at 11:20

answered Dec 18, 2012 at 10:33

Esailija

140k24 gold badges280 silver badges328 bronze badges

Collectives™ on Stack Overflow

Java Charset InputStreamReader, File Channel Differences

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related