1

My database is utf8 and does not support inserting data beyond BMP character encoding such as emoticons. But I don't want to set the database to utf8mb4. How to perform validation on the backend that does not support data beyond BMP character encoding, such as regular expressions or other methods.

I know the range of BMP character is U+0000和U+FFFF, butI don't know if there is a regular expression to verify whether the given value is within or outside this encoding range

0

2 Answers 2

0

tl;dr

boolean allInBmp =
    input
    .codePoints()
    .allMatch( Character :: isBmpCodePoint ) ;

Details

“My database is utf8” and “does not support inserting data beyond BMP character encoding” are contradictory statements. Supporting UTF-8 means supporting all of Unicode.

The MySQL developers choose an incorrect name of “utf8” for their non-Unicode encoding.

But, ignoring that contradiction, you can examine each of the characters within a string in Java.

if there is a regular expression

RegEx is overkill for this problem.

Get the code point of each character by calling String#codePoints. Then examine each code point in the IntStream, testing for isBmpCodePoint.

Something like this untested code.

boolean allInBmp =
    input
    .codePoints()
    .filter( codePoint -> ! Character.isBmpCodePoint( codePoint ) )
    .count()
    == 0 ;

We can make the test simpler by calling isSupplementaryCodePoint. Using a method reference makes the code brief yet clear.

boolean allInBmp =
    input
    .codePoints()
    .filter( Character :: isSupplementaryCodePoint )
    .count()
    == 0 ;

As commented, we can further simplify this process. Instead of (a) running though all the characters, and (b) maintain a running count, we can simply stop after finding a single “supplementary” character (outside the BMP). As a “short-circuiting” operation, the stream processing ends after the first hit. See allMatch and anyMatch.

boolean allInBmp =
    input
    .codePoints()
    .allMatch( Character :: isBmpCodePoint ) ;

If you routinely ran this on very large strings, you could experiment with making the stream parallelized. There is an overhead cost with running parallelized. That cost will not be worth it for smaller strings.


If doing this test in multiple places, you may want to define your own validator using the Jakarta Validation framework.

Sign up to request clarification or add additional context in comments.

4 Comments

Why waste resources in counting the values when you do not want the count? Use a straight-forward input .codePoints() .allMatch(Character::isBmpCodePoint)
@Holger I added coverage of your point. Thank you. I used anyMatch rather than allMatch to hopefully be even more efficient.
There’s no difference in efficiency between allMatch(Character::isBmpCodePoint) or anyMatch(Character::isSupplementaryCodePoint). In either case, it will stop when encountering the first non BMP character.
@Holger Got it. Thanks. I revised the Answer.
0

To check for "beyond-BMP" in a single column in a single table:

SELECT * FROM t WHERE HEX(col) REGEXP "^(..)*F" LIMIT 1;

This will work until Unicode goes beyond Fwxxyyzz.  (Which probably won't happen in my lifetime.)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.