How to perform validation on the coding that does not support data beyond BMP character encoding

Question

My database is utf8 and does not support inserting data beyond BMP character encoding such as emoticons. But I don't want to set the database to utf8mb4. How to perform validation on the backend that does not support data beyond BMP character encoding, such as regular expressions or other methods.

I know the range of BMP character is U+0000和U+FFFF, butI don't know if there is a regular expression to verify whether the given value is within or outside this encoding range

Basil Bourque · Accepted Answer · 2024-05-07 13:49:21Z

0

tl;dr

boolean allInBmp =
    input
    .codePoints()
    .allMatch( Character :: isBmpCodePoint ) ;

Details

“My database is utf8” and “does not support inserting data beyond BMP character encoding” are contradictory statements. Supporting UTF-8 means supporting all of Unicode.

The MySQL developers choose an incorrect name of “utf8” for their non-Unicode encoding.

But, ignoring that contradiction, you can examine each of the characters within a string in Java.

if there is a regular expression

RegEx is overkill for this problem.

Get the code point of each character by calling String#codePoints. Then examine each code point in the IntStream, testing for isBmpCodePoint.

Something like this untested code.

boolean allInBmp =
    input
    .codePoints()
    .filter( codePoint -> ! Character.isBmpCodePoint( codePoint ) )
    .count()
    == 0 ;

We can make the test simpler by calling isSupplementaryCodePoint. Using a method reference makes the code brief yet clear.

boolean allInBmp =
    input
    .codePoints()
    .filter( Character :: isSupplementaryCodePoint )
    .count()
    == 0 ;

As commented, we can further simplify this process. Instead of (a) running though all the characters, and (b) maintain a running count, we can simply stop after finding a single “supplementary” character (outside the BMP). As a “short-circuiting” operation, the stream processing ends after the first hit. See allMatch and anyMatch.

boolean allInBmp =
    input
    .codePoints()
    .allMatch( Character :: isBmpCodePoint ) ;

If you routinely ran this on very large strings, you could experiment with making the stream parallelized. There is an overhead cost with running parallelized. That cost will not be worth it for smaller strings.

If doing this test in multiple places, you may want to define your own validator using the Jakarta Validation framework.

edited May 7, 2024 at 13:49

answered May 6, 2024 at 3:33

Basil Bourque

347k130 gold badges950 silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Holger Over a year ago

Why waste resources in counting the values when you do not want the count? Use a straight-forward input .codePoints() .allMatch(Character::isBmpCodePoint)

Basil Bourque Over a year ago

@Holger I added coverage of your point. Thank you. I used anyMatch rather than allMatch to hopefully be even more efficient.

Holger Over a year ago

There’s no difference in efficiency between allMatch(Character::isBmpCodePoint) or anyMatch(Character::isSupplementaryCodePoint). In either case, it will stop when encountering the first non BMP character.

Basil Bourque Over a year ago

@Holger Got it. Thanks. I revised the Answer.

Rick James · Accepted Answer · 2025-05-15 22:26:12Z

0

To check for "beyond-BMP" in a single column in a single table:

SELECT * FROM t WHERE HEX(col) REGEXP "^(..)*F" LIMIT 1;

This will work until Unicode goes beyond Fwxxyyzz.  (Which probably won't happen in my lifetime.)

answered May 15 at 22:26

Rick James

144k15 gold badges144 silver badges255 bronze badges

Collectives™ on Stack Overflow

How to perform validation on the coding that does not support data beyond BMP character encoding

2 Answers 2

tl;dr

Details

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

tl;dr

Details

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related