tl;dr
boolean allInBmp =
input
.codePoints()
.allMatch( Character :: isBmpCodePoint ) ;
Details
“My database is utf8” and “does not support inserting data beyond BMP character encoding” are contradictory statements. Supporting UTF-8 means supporting all of Unicode.
The MySQL developers choose an incorrect name of “utf8” for their non-Unicode encoding.
But, ignoring that contradiction, you can examine each of the characters within a string in Java.
if there is a regular expression
RegEx is overkill for this problem.
Get the code point of each character by calling String#codePoints. Then examine each code point in the IntStream, testing for isBmpCodePoint.
Something like this untested code.
boolean allInBmp =
input
.codePoints()
.filter( codePoint -> ! Character.isBmpCodePoint( codePoint ) )
.count()
== 0 ;
We can make the test simpler by calling isSupplementaryCodePoint. Using a method reference makes the code brief yet clear.
boolean allInBmp =
input
.codePoints()
.filter( Character :: isSupplementaryCodePoint )
.count()
== 0 ;
As commented, we can further simplify this process. Instead of (a) running though all the characters, and (b) maintain a running count, we can simply stop after finding a single “supplementary” character (outside the BMP). As a “short-circuiting” operation, the stream processing ends after the first hit. See allMatch and anyMatch.
boolean allInBmp =
input
.codePoints()
.allMatch( Character :: isBmpCodePoint ) ;
If you routinely ran this on very large strings, you could experiment with making the stream parallelized. There is an overhead cost with running parallelized. That cost will not be worth it for smaller strings.
If doing this test in multiple places, you may want to define your own validator using the Jakarta Validation framework.