First, both of your variants have the problem of not handling characters outside the BMP correctly.
To support these characters, there is codePoints() as an alternative to chars(). You can use appendCodePoint on the target StringBuilder to consistently use codepoints throughout the entire operation. For this, you have to remove the unnecessary .mapToObj(ch -> (char) ch) step, whose removal also eliminates the overhead of creating a Stream<Character>.
Then, you can avoid the conversion to a String in both cases, by encoding the StringBuilder using the Charset directly. In the case of the stream variant:
StringBuilder sb = s.codePoints()
.filter(Character::isDigit)
.collect(StringBuilder::new,
StringBuilder::appendCodePoint, StringBuilder::append);
ByteBuffer bb = StandardCharsets.UTF_8.encode(CharBuffer.wrap(sb));
byte[] utf8Bytes = new byte[bb.remaining()];
bb.get(utf8Bytes);
Performing the conversion directly with the stream of codepoints is not easy. Not only is there no such support in the Charset API, there is no straight-forward way to collect a Stream into a byte[] array.
One possibility is
byte[] utf8Bytes = s.codePoints()
.filter(Character::isDigit)
.flatMap(c -> c<128? IntStream.of(c):
c<0x800? IntStream.of((c>>>6)|0xC0, c&0x3f|0x80):
c<0x10000? IntStream.of((c>>>12)|0xE0, (c>>>6)&0x3f|0x80, c&0x3f|0x80):
IntStream.of((c>>>18)|0xF0, (c>>>12)&0x3f|0x80, (c>>>6)&0x3f|0x80, c&0x3f|0x80))
.collect(
() -> new Object() { byte[] array = new byte[8]; int size;
byte[] result(){ return array.length==size? array: Arrays.copyOf(array,size); }
},
(b,i) -> {
if(b.array.length == b.size) b.array=Arrays.copyOf(b.array, b.size*2);
b.array[b.size++] = (byte)i;
},
(a,b) -> {
if(a.array.length<a.size+b.size) a.array=Arrays.copyOf(a.array,a.size+b.size);
System.arraycopy(b.array, 0, a.array, a.size, b.size);
a.size+=b.size;
}).result();
The flatMap step converts the stream of codepoints to a stream of UTF-8 unit. (Compare with the UTF-8 description on Wikipedia) The collect step collects int values into a byte[] array.
It’s possible to eliminate the flatMap step by creating a dedicate collector which collects a stream of codepoints directly into a byte[] array
byte[] utf8Bytes = s.codePoints()
.filter(Character::isDigit)
.collect(
() -> new Object() { byte[] array = new byte[8]; int size;
byte[] result(){ return array.length==size? array: Arrays.copyOf(array,size); }
void put(int c) {
if(array.length == size) array=Arrays.copyOf(array, size*2);
array[size++] = (byte)c;
}
},
(b,c) -> {
if(c < 128) b.put(c);
else {
if(c<0x800) b.put((c>>>6)|0xC0);
else {
if(c<0x10000) b.put((c>>>12)|0xE0);
else {
b.put((c>>>18)|0xF0);
b.put((c>>>12)&0x3f|0x80);
}
b.put((c>>>6)&0x3f|0x80);
}
b.put(c&0x3f|0x80);
}
},
(a,b) -> {
if(a.array.length<a.size+b.size) a.array=Arrays.copyOf(a.array,a.size+b.size);
System.arraycopy(b.array, 0, a.array, a.size, b.size);
a.size+=b.size;
}).result();
but it doesn’t add to readability.
You can test the solutions using a String like
String s = "some test text 1234 ✔ 3 𝟝";
and printing the result as
System.out.println(Arrays.toString(utf8Bytes));
System.out.println(new String(utf8Bytes, StandardCharsets.UTF_8));
which should produce
[49, 50, 51, 52, -17, -68, -109, -16, -99, -97, -99]
12343𝟝
It should be obvious that the first variant is the simplest, and it will have reasonable performance, even if it doesn’t create a byte[] array directly. Further, it’s the only variant which can be adapted for getting other result charsets.
But even the
byte[] utf8Bytes = s.codePoints()
.filter(Character::isDigit)
.collect(StringBuilder::new,
StringBuilder::appendCodePoint, StringBuilder::append)
.toString().getBytes(StandardCharsets.UTF_8);
is not so bad, regardless of whether the toString() operation bears a copying operation.
Stream<Character>in your code. It’s anIntStream..mapToObj(ch -> (char) ch)theIntStreamis converted toStream<Character>appendCodePoint, you should also usecodePoints()instead ofchars().IntstreamtoStringconversion ?StringBuilder::appendCodePointtoStringBuilder::append