1

The following code takes a String s, converts into char array, filters digits from it, then converts it to string, then converts into byte array.

char charArray[] = s.toCharArray();
StringBuffer sb = new StringBuffer(charArray.length);
for(int i=0; i<=charArray.length-1; i++) {
    if (Character.isDigit(charArray[i]))
        sb.append(charArray[i]);
}
byte[] bytes = sb.toString().getBytes(Charset.forName("UTF-8")); 

I'm trying to change the above code to streams approach. Following is working.

s.chars()
.sequential()
.mapToObj(ch -> (char) ch)
.filter(Character::isDigit)
.collect(StringBuilder::new,
        StringBuilder::append, StringBuilder::append)
.toString()
.getBytes(Charset.forName("UTF-8"));

I think there could be a better way to do it.

Can we directly convert Stream<Character> to byte[] & skip the conversion to String in between?

6
  • There is no Stream<Character> in your code. It’s an IntStream. Commented Oct 12, 2018 at 11:26
  • After .mapToObj(ch -> (char) ch) the IntStream is converted to Stream<Character> Commented Oct 12, 2018 at 11:29
  • Yes, but that step is entirely obsolete. Just remove it. Further, when you use appendCodePoint, you should also use codePoints() instead of chars(). Commented Oct 12, 2018 at 11:30
  • OK. But, still the question is can we skip the intermediate Intstream to String conversion ? Commented Oct 12, 2018 at 11:33
  • Also, I've changed StringBuilder::appendCodePoint to StringBuilder::append Commented Oct 12, 2018 at 11:34

1 Answer 1

4

First, both of your variants have the problem of not handling characters outside the BMP correctly.

To support these characters, there is codePoints() as an alternative to chars(). You can use appendCodePoint on the target StringBuilder to consistently use codepoints throughout the entire operation. For this, you have to remove the unnecessary .mapToObj(ch -> (char) ch) step, whose removal also eliminates the overhead of creating a Stream<Character>.

Then, you can avoid the conversion to a String in both cases, by encoding the StringBuilder using the Charset directly. In the case of the stream variant:

StringBuilder sb = s.codePoints()
    .filter(Character::isDigit)
    .collect(StringBuilder::new,
             StringBuilder::appendCodePoint, StringBuilder::append);

ByteBuffer bb = StandardCharsets.UTF_8.encode(CharBuffer.wrap(sb));
byte[] utf8Bytes = new byte[bb.remaining()];
bb.get(utf8Bytes);

Performing the conversion directly with the stream of codepoints is not easy. Not only is there no such support in the Charset API, there is no straight-forward way to collect a Stream into a byte[] array.

One possibility is

byte[] utf8Bytes = s.codePoints()
    .filter(Character::isDigit)
    .flatMap(c -> c<128? IntStream.of(c):
        c<0x800? IntStream.of((c>>>6)|0xC0, c&0x3f|0x80):
        c<0x10000? IntStream.of((c>>>12)|0xE0, (c>>>6)&0x3f|0x80, c&0x3f|0x80):
        IntStream.of((c>>>18)|0xF0, (c>>>12)&0x3f|0x80, (c>>>6)&0x3f|0x80, c&0x3f|0x80))
    .collect(
        () -> new Object() { byte[] array = new byte[8]; int size;
            byte[] result(){ return array.length==size? array: Arrays.copyOf(array,size); }
        },
        (b,i) -> {
            if(b.array.length == b.size) b.array=Arrays.copyOf(b.array, b.size*2);
            b.array[b.size++] = (byte)i;
        },
        (a,b) -> {
            if(a.array.length<a.size+b.size) a.array=Arrays.copyOf(a.array,a.size+b.size);
            System.arraycopy(b.array, 0, a.array, a.size, b.size);
            a.size+=b.size;
        }).result();

The flatMap step converts the stream of codepoints to a stream of UTF-8 unit. (Compare with the UTF-8 description on Wikipedia) The collect step collects int values into a byte[] array.

It’s possible to eliminate the flatMap step by creating a dedicate collector which collects a stream of codepoints directly into a byte[] array

byte[] utf8Bytes = s.codePoints()
    .filter(Character::isDigit)
    .collect(
        () -> new Object() { byte[] array = new byte[8]; int size;
            byte[] result(){ return array.length==size? array: Arrays.copyOf(array,size); }
            void put(int c) {
                if(array.length == size) array=Arrays.copyOf(array, size*2);
                array[size++] = (byte)c;
            }
        },
        (b,c) -> {
            if(c < 128) b.put(c);
            else {
                if(c<0x800) b.put((c>>>6)|0xC0);
                else {
                    if(c<0x10000) b.put((c>>>12)|0xE0);
                    else {
                        b.put((c>>>18)|0xF0);
                        b.put((c>>>12)&0x3f|0x80);
                    }
                    b.put((c>>>6)&0x3f|0x80);
                }
                b.put(c&0x3f|0x80);
            }
       },
       (a,b) -> {
            if(a.array.length<a.size+b.size) a.array=Arrays.copyOf(a.array,a.size+b.size);
            System.arraycopy(b.array, 0, a.array, a.size, b.size);
            a.size+=b.size;
       }).result();

but it doesn’t add to readability.

You can test the solutions using a String like

String s = "some test text 1234 ✔ 3 𝟝";

and printing the result as

System.out.println(Arrays.toString(utf8Bytes));
System.out.println(new String(utf8Bytes, StandardCharsets.UTF_8));

which should produce

[49, 50, 51, 52, -17, -68, -109, -16, -99, -97, -99]
12343𝟝

It should be obvious that the first variant is the simplest, and it will have reasonable performance, even if it doesn’t create a byte[] array directly. Further, it’s the only variant which can be adapted for getting other result charsets.

But even the

byte[] utf8Bytes = s.codePoints()
    .filter(Character::isDigit)
    .collect(StringBuilder::new,
             StringBuilder::appendCodePoint, StringBuilder::append)
    .toString().getBytes(StandardCharsets.UTF_8);

is not so bad, regardless of whether the toString() operation bears a copying operation.

Sign up to request clarification or add additional context in comments.

8 Comments

I doubt that many people (me including) will actually understand why and what you are doing in that big operation - I only have basic hints, but booooy... can you add at least one minor explanation like where should everyone start looking from?
@Eugene which big operation? There are two. For the conversion to UTF-8, a good start would be en.wikipedia.org/wiki/UTF-8#Description
that is pretty darn impressive, how many people do you know that deal with UTF outside BMP in their apps? I am 100% not one
@Eugene well most people encounter characters outside the BMP due to Emojis…
@PankajSinghal then, you have no suitable font, but you managed to copy the character correctly, so your system can process it. It’s a 5 with double outline (U+1D7DD). I suppose, when you google for U+1D7DD you will find pages with example renderings. E.g. graphemica.com/%F0%9D%9F%9D
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.