# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
my $utf8 = shift;
# This will be used against a reversed string.
state $complete_seq$incomplete_seq = qr/
^
(?: (?&invalid)
| (?&lead_of_1&lead_of_2)
| (?&cont_byte) (?: (?&invalid&lead_of_3)
| (?&lead_of_2&lead_of_4)
| (?&cont_byte) (?: (?&invalid&lead_of_3)
| (?&lead_of_3&lead_of_4)
| (?&cont_byte) (?&lead_of_4) ))
(?(DEFINE)
(?<lead_of_1> [\x00-\x7F] )
(?<cont_byte> [\x80-\xBF] )
(?<lead_of_2> [\xC0-\xDF] )
(?<lead_of_3> [\xE0-\xEF] )
(?<lead_of_4> [\xF0-\xF7] )
(?<invalid> [\xF8-\xFF] )
)
/x;
return reverse(substr($utf8, -4)) =~!~ $complete_seq;$incomplete_seq;
}
I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name.
Adjust to handle invalid UTF-8 as desired.
There's no benefit to using qr// here; I did that for readability.Notes:
Performance would benefit from writing this function in C instead of using the regex engine.
- Adjust to handle invalid UTF-8 as desired.
- There's no performance benefit to using
qr// here; I did that for readability.
- Performance could benefit from creating a class equivalent to
(?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).
- Performance would benefit from writing this function in C instead of using the regex engine.
- The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use
is_incomplete_utf8, but that could lead to double-negation in the caller.)
- There is a such as excessive commenting :)
- It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas is useless in your code.
/a affects some built-in characters classes (e.g. \s) and \b.
/aa affects a subset of what /a does.
/s affects ..