Stream of Char to Stream of Byte/Byte Array

Question

The following code takes a String s, converts into char array, filters digits from it, then converts it to string, then converts into byte array.

char charArray[] = s.toCharArray();
StringBuffer sb = new StringBuffer(charArray.length);
for(int i=0; i<=charArray.length-1; i++) {
    if (Character.isDigit(charArray[i]))
        sb.append(charArray[i]);
}
byte[] bytes = sb.toString().getBytes(Charset.forName("UTF-8"));

I'm trying to change the above code to streams approach. Following is working.

s.chars()
.sequential()
.mapToObj(ch -> (char) ch)
.filter(Character::isDigit)
.collect(StringBuilder::new,
        StringBuilder::append, StringBuilder::append)
.toString()
.getBytes(Charset.forName("UTF-8"));

I think there could be a better way to do it.

Can we directly convert Stream<Character> to byte[] & skip the conversion to String in between?

There is no Stream<Character> in your code. It’s an IntStream. — Holger
– Holger, Commented Oct 12, 2018 at 11:26
After .mapToObj(ch -> (char) ch) the IntStream is converted to Stream<Character> — Pankaj Singhal
– Pankaj Singhal, Commented Oct 12, 2018 at 11:29
Yes, but that step is entirely obsolete. Just remove it. Further, when you use appendCodePoint, you should also use codePoints() instead of chars(). — Holger
– Holger, Commented Oct 12, 2018 at 11:30
OK. But, still the question is can we skip the intermediate Intstream to String conversion ? — Pankaj Singhal
– Pankaj Singhal, Commented Oct 12, 2018 at 11:33
Also, I've changed StringBuilder::appendCodePoint to StringBuilder::append — Pankaj Singhal
– Pankaj Singhal, Commented Oct 12, 2018 at 11:34

Holger · Accepted Answer · 2018-10-12 13:35:29Z

4

First, both of your variants have the problem of not handling characters outside the BMP correctly.

To support these characters, there is codePoints() as an alternative to chars(). You can use appendCodePoint on the target StringBuilder to consistently use codepoints throughout the entire operation. For this, you have to remove the unnecessary .mapToObj(ch -> (char) ch) step, whose removal also eliminates the overhead of creating a Stream<Character>.

Then, you can avoid the conversion to a String in both cases, by encoding the StringBuilder using the Charset directly. In the case of the stream variant:

StringBuilder sb = s.codePoints()
    .filter(Character::isDigit)
    .collect(StringBuilder::new,
             StringBuilder::appendCodePoint, StringBuilder::append);

ByteBuffer bb = StandardCharsets.UTF_8.encode(CharBuffer.wrap(sb));
byte[] utf8Bytes = new byte[bb.remaining()];
bb.get(utf8Bytes);

Performing the conversion directly with the stream of codepoints is not easy. Not only is there no such support in the Charset API, there is no straight-forward way to collect a Stream into a byte[] array.

One possibility is

byte[] utf8Bytes = s.codePoints()
    .filter(Character::isDigit)
    .flatMap(c -> c<128? IntStream.of(c):
        c<0x800? IntStream.of((c>>>6)|0xC0, c&0x3f|0x80):
        c<0x10000? IntStream.of((c>>>12)|0xE0, (c>>>6)&0x3f|0x80, c&0x3f|0x80):
        IntStream.of((c>>>18)|0xF0, (c>>>12)&0x3f|0x80, (c>>>6)&0x3f|0x80, c&0x3f|0x80))
    .collect(
        () -> new Object() { byte[] array = new byte[8]; int size;
            byte[] result(){ return array.length==size? array: Arrays.copyOf(array,size); }
        },
        (b,i) -> {
            if(b.array.length == b.size) b.array=Arrays.copyOf(b.array, b.size*2);
            b.array[b.size++] = (byte)i;
        },
        (a,b) -> {
            if(a.array.length<a.size+b.size) a.array=Arrays.copyOf(a.array,a.size+b.size);
            System.arraycopy(b.array, 0, a.array, a.size, b.size);
            a.size+=b.size;
        }).result();

The flatMap step converts the stream of codepoints to a stream of UTF-8 unit. (Compare with the UTF-8 description on Wikipedia) The collect step collects int values into a byte[] array.

It’s possible to eliminate the flatMap step by creating a dedicate collector which collects a stream of codepoints directly into a byte[] array

byte[] utf8Bytes = s.codePoints()
    .filter(Character::isDigit)
    .collect(
        () -> new Object() { byte[] array = new byte[8]; int size;
            byte[] result(){ return array.length==size? array: Arrays.copyOf(array,size); }
            void put(int c) {
                if(array.length == size) array=Arrays.copyOf(array, size*2);
                array[size++] = (byte)c;
            }
        },
        (b,c) -> {
            if(c < 128) b.put(c);
            else {
                if(c<0x800) b.put((c>>>6)|0xC0);
                else {
                    if(c<0x10000) b.put((c>>>12)|0xE0);
                    else {
                        b.put((c>>>18)|0xF0);
                        b.put((c>>>12)&0x3f|0x80);
                    }
                    b.put((c>>>6)&0x3f|0x80);
                }
                b.put(c&0x3f|0x80);
            }
       },
       (a,b) -> {
            if(a.array.length<a.size+b.size) a.array=Arrays.copyOf(a.array,a.size+b.size);
            System.arraycopy(b.array, 0, a.array, a.size, b.size);
            a.size+=b.size;
       }).result();

but it doesn’t add to readability.

You can test the solutions using a String like

String s = "some test text 1234 ✔ ３ 𝟝";

and printing the result as

System.out.println(Arrays.toString(utf8Bytes));
System.out.println(new String(utf8Bytes, StandardCharsets.UTF_8));

which should produce

[49, 50, 51, 52, -17, -68, -109, -16, -99, -97, -99]
1234３𝟝

It should be obvious that the first variant is the simplest, and it will have reasonable performance, even if it doesn’t create a byte[] array directly. Further, it’s the only variant which can be adapted for getting other result charsets.

But even the

byte[] utf8Bytes = s.codePoints()
    .filter(Character::isDigit)
    .collect(StringBuilder::new,
             StringBuilder::appendCodePoint, StringBuilder::append)
    .toString().getBytes(StandardCharsets.UTF_8);

is not so bad, regardless of whether the toString() operation bears a copying operation.

edited Oct 12, 2018 at 13:35

answered Oct 12, 2018 at 13:12

Holger

301k43 gold badges483 silver badges829 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Eugene Over a year ago

I doubt that many people (me including) will actually understand why and what you are doing in that big operation - I only have basic hints, but booooy... can you add at least one minor explanation like where should everyone start looking from?

Holger Over a year ago

@Eugene which big operation? There are two. For the conversion to UTF-8, a good start would be en.wikipedia.org/wiki/UTF-8#Description

Eugene Over a year ago

that is pretty darn impressive, how many people do you know that deal with UTF outside BMP in their apps? I am 100% not one

Holger Over a year ago

@Eugene well most people encounter characters outside the BMP due to Emojis…

Holger Over a year ago

@PankajSinghal then, you have no suitable font, but you managed to copy the character correctly, so your system can process it. It’s a 5 with double outline (U+1D7DD). I suppose, when you google for U+1D7DD you will find pages with example renderings. E.g. graphemica.com/%F0%9D%9F%9D

|

Collectives™ on Stack Overflow

Stream of Char to Stream of Byte/Byte Array

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related