1

I'm trying to read a (Japanese) file that is encoded as a UTF-16 file.

When I read it using an InputStreamReader with a charset of 'UTF-16" the file is read correctly:

try {
        InputStreamReader read = new InputStreamReader(new FileInputStream("JapanTest.txt"), "UTF-16");
        BufferedReader in = new BufferedReader(read);
        String str;
        while((str=in.readLine())!=null){           
            System.out.println(str);
    }
    in.close();
}catch (Exception e){
    System.out.println(e);
}

However, when I use File Channels and read from a byte array the Strings aren't always converted correctly:

    File f = new File("JapanTest.txt");
    fis = new FileInputStream(f);
    channel = fis.getChannel();
     MappedByteBuffer buffer = channel.map( FileChannel.MapMode.READ_ONLY, 0L, channel.size());
     buffer.position(0);
    int get = Math.min(buffer.remaining(), 1024);
    byte[] barray = new byte[1024];
    buffer.get(barray, 0, get);
    CharSet charSet = Charset.forName("UTF-16");
    //endOfLinePos is a calculated value and defines the number of bytes to read
    rowString = new String(barray, 0, endOfLinePos, charSet);               
    System.out.println(rowString);

The problem I've found is that I can only read characters correctly if the MappedByteBuffer is at position 0. If I increment the position of the MappedByteBuffer and then read a number of bytes into a byte array, which is then converted to a string using the charset UTF-16, then the bytes are not converted correctly. I haven't faced this issue if a file is encoded in UTF-8, so is this only an issue with UTF-16?

More Details: I need to be able to read any line from the file channel, so to do this I build a list of line ending byte positions and then use those positions to be able to get the bytes for any given line and then convert them to a string.

2
  • 1
    probably endOfLinePos is calculated wrong; if it's an odd number, it's trouble since UTF-16 requires even number of bytes. Commented Dec 18, 2012 at 6:31
  • I suspect that possibly the 1024 character groups are splitting a UTF-16 character in "the middle." Commented Dec 18, 2012 at 7:18

2 Answers 2

2

Possibly, the InputStreamReader does some transformations the normal new String(...) does not. As a work-around (and to verify this assumption) you could try to wrap the data read from the channel like new InputStreamReader( new ByteArrayInputStream( barray ) ).

Edit: Forget that :) - Channels.newReader() would be the way to go.

Sign up to request clarification or add additional context in comments.

4 Comments

I need to be able to read any line from the file channel, so to do this I build a list of line ending byte positions and then use those positions to be able to get the bytes for any given line and then convert them to a string.
For line-wise processing the LineNumberReader is used.
True - but I want to use File Channels because the files can be large so it will save a lot of memory.
So, why not use new LineNumberReader( channel.newReader(...) )? - It will read the file sequentially, buffering only a small amount of data (kilobytes maybe) at any instant.
1

The code unit of UTF-16 is 2 bytes, not a byte like UTF-8. The pattern and single byte code unit length makes UTF-8 self-synchronizing; it can read correctly at any point and if it's a continuation byte, it can either backtrack or lose only a single character.

With UTF-16 you must always work with pairs of bytes, you cannot start reading at an odd byte or stop reading at an odd byte. You also must know the endianess, and use either UTF-16LE or UTF-16BE when not reading at the start of the file, because there will be no BOM.

You can also encode the file as UTF-8.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.