3

Given a valid text file file and its java.nio.charset.Charset how can I efficiently (preferably using RandomAccessFile.seek() or InputStream.skip(), without reading the whole file) split it into two or more chunks while ensuring that no chunks contain partial code points (it would be nice not to split a character/grapheme, but that's probably too hard)?

For fixed-length encodings the answer is trivial - split on aligned positions. However, I'm not sure if CharsetDecoder.averageBytesPerChar() == CharsetDecoder.maxBytesPerChar() is a proper indication of a fixed length encoding, so it would be nice to find one.

Some variable length encodings are not self-synchron­izing, how can I find their character or code point in a byte stream after split?

The objective is to split the file for parallel processing. Classic IO fails to load all CPU cores while reading a file sequentially.

10
  • StackOverflow is not a code writing service. Please edit your question to show us what you've tried. Commented Sep 15 at 15:44
  • 3
    how can I efficiently if you want to support all possible charsets (like utf16 or some jp/cn) - then you can't do this, you have to read whole file sequentially first to find possible split positions and only then send file to parallel processing, of course you can try to hardcode what charsets are safe to split (like ascii or utf8) and use quicker method for them, but for the rest - rely on initial preprocessing Commented Sep 15 at 15:49
  • 1
    @JannikS. my current implementation is not fast enough. The question is not about writing code - it is about API and a nature of encodings in general. I can't find any means/API to synchronize arbitrary encoding. Commented Sep 15 at 15:49
  • @IłyaBursov this would be a nice answer if there were some means to detect a self-synchronizing encoding. Commented Sep 15 at 15:51
  • 1
    @gthanop the obvious solution is to search for a match in all possible pairs of characters, but universe heat death does not let me. Commented Sep 15 at 16:28

1 Answer 1

6

What you want is completely impossible.

Charset is a broad abstraction. The intent is that as many imaginable charset encodings can be represented by it as possible. This is a general principle in programming: The more 'handcuffs' you stick on an interface, the easier it is to work with and the more you can do with it, but the fewer systems actually fit in the abstraction.

Perhaps instead you wanted to ask a slightly different question:

  • How do I split a text file written in UTF-8.

Or perhaps:

  • How do I split a text file for as many encodings as possible.

For the UTF family it is quite easy. But note that UTF-8 is generally praised as a magnificent invention because it has the property that the thing you want to do is in fact possible. Simple takes on charset design result in designs where the job you want is literally impossible.

Imagine the following charset encoding for unicode:

  1. A single byte indicating how many bytes are used to represent each character. So, 1, 2, 3, or 4. Let's call this charSize.
  2. A 32-bit value indicating how many characters follow, represented by that encoding. Let's call this runLen.
  3. runLen * charSize bytes.
  4. Go back to #1.

This encoding is plausible and could trivially be turned into a complete java Charset definition.

And yet, the thing you want is not possible. In a random access system you can still get through it faster than literally streaming through it (once you read #1 and #2 you can just skip over runLen * charSize part if you don't need to split inside it), but you can't simply read some bytes in the middle and know where to chop. You must start from the first byte, and there are no method that Charset has which would allow charset-agnostic code to most efficiently chop such a file up. A 'chopper' for this hypothetical format CAN exist and can be fairly efficient, but it'd have to written specifically for this exact charset encoding.

QED: An algorithm that can efficiently chop any input given only 'random access stream of bytes' and '1 Charset impl' is not possible.

How to do it for UTF-8 specifically

A single UTF-8 value has the convenient property that any byte that starts with bits 10 is a continuation; and any byte that isn't that, defines a character and may have continuations (bytes starting with bits 10). You in fact know how many continuation bytes follow based only on that first byte:

  • First bit is 0: That byte is a whole character (ASCII).
  • First bits are 10: This is a continuation. Go back.
  • First bits are 110: This is a 2-byte thing.
  • First bits are 1110: 3-byte.
  • First bits are 11110: 4-byte.

Therefore, to split a file in the middle, simply read a byte, and keep reading until you hit a byte that does not start with 10 (i.e. (b & 0xC0) != 0xC0)) - that is the start of a whole new character. Include all bytes before this point in 'the left chunk' and the not-10 byte + all that remains in the 'right chunk'.

A warning about unicode

Unicode is a lot more complicated than this. UTF-8 can be trivially lopped into bits with random access fast performance, but a sequence of complete symbols nevertheless go together. For example, this sequence of unicode values:

  • U+0065
  • U+0301

Is the symbol é. 2 unicode values (not UTF-8 bytes; no, full unicode values. In UTF-8 terms, 2 bytes that do not start with 0b10) - one symbol. That's the plain jane ascii e plus the unicode symbol "put ´ on the previous symbol".

Similar shenanigans occur with emoji (you can easily hook together 7 or more emojis which are themselves in java surrogate pairs, for strings whose .length() would return 14 or more, and yet they are 1 single glyph. Flag emojis work like this, and you have modifiers. You can have 'hug' + 'man' + 'man' + 'brown' + 'olive' or whatever - to indicate the genders and colours of who is doing the hugging.

Another source of problems is directional indicates (there's a unicode character that means ".. and now the text goes right-to-left").

If you chop a text file in twain, even if you do it right and use the 0b10 trick to ensure you don't chop right through a single unicode value, you can still end up with one file ending in "e" and the next file starting with "´", whereas the source simply had a "é".

You should think about this. If it's important you don't chop emoji modifiers and/or decomposed chars and/or directional modifiers, hoo boy. This question boils down to extremely convoluted code. Think "manmonth of work" levels of complicated.

Remember: Unicode; it's more complicated than you think it is.

Sign up to request clarification or add additional context in comments.

7 Comments

I understand that O(1) split is impossible in general. But how do I find encodings where it is possible? I could then fall back to a slow method for bad encodings.
Hardcode a list of known encodings. It's not enough to know 'it is possible for UTF-8'; you must know how to do it. You must encode this 0xC0 stuff. Pragmatically speaking, a list of 'these encodings are fixed size' + 'this is how you efficiently split UTF-8' takes care of 99% of everything out there, no?
Actually, I do not need to know how to synchronize encodings, because CharsetDecoder API has an option to skip malcoded bytes. So I only need to know if it is at all possible for a given encoding.
Okay, it sounds like what you want is: "Given a charset, how do I determine if the answer is true or false to the question: Will CharsetDecoder's 'Malcoded' thing allow me to implement this algorithm by just feeding a sequence of bytes obtained by jumping to an arbitrary midpoint and looping 1 byte forward until decoding the sequence does not result in a malcoded error". There is no such API. You'd have to hardcode a list.
Could you edit this into answer somehow?
Well, for 8 bit encodings you can check whether they have a 1:1 mapping that would allow such an optimization. But you’d miss the important UTF-8 encoding with such a check, so hardcoding this special case would still be needed. Whereas the majority of charsets you’d cover with such check are irrelevant in today’s IT.
Keep in mind that since Java 9, Files.lines(…) already does this: “This implementation supports good parallel stream performance for the standard charsets UTF-8, US-ASCII and ISO-8859-1. Such line-optimal charsets have the property that the encoded bytes of a line feed ('\n') or a carriage return ('\r') are efficiently identifiable from other encoded characters when randomly accessing the bytes of the file.” Don’t waste your time reinventing the wheel.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.