Revisions to Checking whether a string fragment could be part of a longer UTF-8 string

deleted 48 characters in body

Source Link

edited May 29, 2020 at 3:19

ikegami

276
1
9

This sub is expected to return a scalar (something true or false). And while your sub does that when it's invoked in scalar context, it produces surprising results in list context. Replace return with return 0;.

Is 1; really better than return 1;?

The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
There is a such as excessive commenting :)
It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas is useless in your code.
- /aaa affects some built-in characters classes (e.g. \s) and \b.
- /aa affects a subset of what /a does.
- /s affects ..

The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
There is a such as excessive commenting :)
It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas is useless in your code.
- /a affects some built-in characters classes (e.g. \s) and \b.
- /aa affects a subset of what /a does.
- /s affects ..

This sub is expected to return a scalar (something true or false). And while your sub does that when it's invoked in scalar context, it produces surprising results in list context. Replace return with return 0;.

Is 1; really better than return 1;?

The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
There is a such as excessive commenting :)
It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas is useless in your code.
- /aa affects some built-in characters classes (e.g. \s) and \b.
- /s affects ..

added 234 characters in body

Source Link

edited May 28, 2020 at 21:41

ikegami

276
1
9

Performance improvements:

# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
   my $utf8 = shift;

   # This will be used against a reversed string.
   state $incomplete_seq = qr/
      ^
      (?: (?&lead_of_2)
      |   (?&lead_of_3)
      |   (?&lead_of_4)
      |   (?&cont_byte) (?: (?&lead_of_3)
                        |   (?&lead_of_4)
                        |   (?&cont_byte) (?&lead_of_4) ))

      (?(DEFINE)
         (?<lead_of_1> [\x00-\x7F] )
         (?<cont_byte> [\x80-\xBF] )
         (?<lead_of_2> [\xC0-\xDF] )
         (?<lead_of_3> [\xE0-\xEF] )
         (?<lead_of_4> [\xF0-\xF7] )
         (?<invalid>   [\xF8-\xFF] )
      )
   /x;

   return reverse(substr($utf8, -4)) !~ $incomplete_seq;
}

Notes:

Adjust to handle invalid UTF-8 as desired.
I started by writing a version that used a pattern that looked for complete sequences, but it handled invalid UTF-8 poorly. If invalid seqences are to be considered complete, it's cleaner and simpler to match incomplete sequences.

There's no performance benefit to using qr// here; I did that for readability.
Performance could benefit from creating a class equivalent to (?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).
Performance would benefit from writing this function in C instead of using the regex engine.

Comments on your code:

The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
There is a such as excessive commenting :)
It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas is useless in your code.
- /a affects some built-in characters classes (e.g. \s) and \b.
- /aa affects a subset of what /a does.
- /s affects ..

Performance improvements:

# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
   my $utf8 = shift;

   # This will be used against a reversed string.
   state $incomplete_seq = qr/
      ^
      (?: (?&lead_of_2)
      |   (?&lead_of_3)
      |   (?&lead_of_4)
      |   (?&cont_byte) (?: (?&lead_of_3)
                        |   (?&lead_of_4)
                        |   (?&cont_byte) (?&lead_of_4) ))

      (?(DEFINE)
         (?<lead_of_1> [\x00-\x7F] )
         (?<cont_byte> [\x80-\xBF] )
         (?<lead_of_2> [\xC0-\xDF] )
         (?<lead_of_3> [\xE0-\xEF] )
         (?<lead_of_4> [\xF0-\xF7] )
         (?<invalid>   [\xF8-\xFF] )
      )
   /x;

   return reverse(substr($utf8, -4)) !~ $incomplete_seq;
}

Notes:

Adjust to handle invalid UTF-8 as desired.
There's no performance benefit to using qr// here; I did that for readability.
Performance could benefit from creating a class equivalent to (?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).
Performance would benefit from writing this function in C instead of using the regex engine.

Comments on your code:

The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
There is a such as excessive commenting :)
It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas is useless in your code.
- /a affects some built-in characters classes (e.g. \s) and \b.
- /aa affects a subset of what /a does.
- /s affects ..

Performance improvements:

# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
   my $utf8 = shift;

   # This will be used against a reversed string.
   state $incomplete_seq = qr/
      ^
      (?: (?&lead_of_2)
      |   (?&lead_of_3)
      |   (?&lead_of_4)
      |   (?&cont_byte) (?: (?&lead_of_3)
                        |   (?&lead_of_4)
                        |   (?&cont_byte) (?&lead_of_4) ))

      (?(DEFINE)
         (?<lead_of_1> [\x00-\x7F] )
         (?<cont_byte> [\x80-\xBF] )
         (?<lead_of_2> [\xC0-\xDF] )
         (?<lead_of_3> [\xE0-\xEF] )
         (?<lead_of_4> [\xF0-\xF7] )
         (?<invalid>   [\xF8-\xFF] )
      )
   /x;

   return reverse(substr($utf8, -4)) !~ $incomplete_seq;
}

Notes:

Adjust to handle invalid UTF-8 as desired.
I started by writing a version that used a pattern that looked for complete sequences, but it handled invalid UTF-8 poorly. If invalid seqences are to be considered complete, it's cleaner and simpler to match incomplete sequences.

There's no performance benefit to using qr// here; I did that for readability.
Performance could benefit from creating a class equivalent to (?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).
Performance would benefit from writing this function in C instead of using the regex engine.

Comments on your code:

The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
There is a such as excessive commenting :)
It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas is useless in your code.
- /a affects some built-in characters classes (e.g. \s) and \b.
- /aa affects a subset of what /a does.
- /s affects ..

added 48 characters in body

Source Link

edited May 28, 2020 at 21:33

ikegami

276
1
9

# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
   my $utf8 = shift;

   # This will be used against a reversed string.
   state $complete_seq$incomplete_seq = qr/
      ^
      (?: (?&invalid)
      |   (?&lead_of_1&lead_of_2)
      |   (?&cont_byte) (?: (?&invalid&lead_of_3)
                        |   (?&lead_of_2&lead_of_4)
                        |   (?&cont_byte) (?: (?&invalid&lead_of_3)
                                          |   (?&lead_of_3&lead_of_4)
                                          |   (?&cont_byte) (?&lead_of_4) ))

      (?(DEFINE)
         (?<lead_of_1> [\x00-\x7F] )
         (?<cont_byte> [\x80-\xBF] )
         (?<lead_of_2> [\xC0-\xDF] )
         (?<lead_of_3> [\xE0-\xEF] )
         (?<lead_of_4> [\xF0-\xF7] )
         (?<invalid>   [\xF8-\xFF] )
      )
   /x;

   return reverse(substr($utf8, -4)) =~!~ $complete_seq;$incomplete_seq;
}

I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name.

Adjust to handle invalid UTF-8 as desired.

There's no benefit to using qr// here; I did that for readability.Notes:

Performance would benefit from writing this function in C instead of using the regex engine.

Adjust to handle invalid UTF-8 as desired.

There's no performance benefit to using qr// here; I did that for readability.

Performance could benefit from creating a class equivalent to (?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).

Performance would benefit from writing this function in C instead of using the regex engine.

The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
There is a such as excessive commenting :)
It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas is useless in your code.
- /a affects some built-in characters classes (e.g. \s) and \b.
- /aa affects a subset of what /a does.
- /s affects ..

# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
   my $utf8 = shift;

   # This will be used against a reversed string.
   state $complete_seq = qr/
      ^
      (?: (?&invalid)
      |   (?&lead_of_1)
      |   (?&cont_byte) (?: (?&invalid)
                        |   (?&lead_of_2)
                        |   (?&cont_byte) (?: (?&invalid)
                                          |   (?&lead_of_3)
                                          |   (?&cont_byte) )))

      (?(DEFINE)
        (?<lead_of_1> [\x00-\x7F] )
        (?<cont_byte> [\x80-\xBF] )
        (?<lead_of_2> [\xC0-\xDF] )
        (?<lead_of_3> [\xE0-\xEF] )
        (?<lead_of_4> [\xF0-\xF7] )
        (?<invalid>   [\xF8-\xFF] )
      )
   /x;

   return reverse(substr($utf8, -4)) =~ $complete_seq;
}

I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name.

Adjust to handle invalid UTF-8 as desired.

There's no benefit to using qr// here; I did that for readability.

Performance would benefit from writing this function in C instead of using the regex engine.

There is a such as excessive commenting :)
It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas is useless in your code.
- /a affects some built-in characters classes (e.g. \s) and \b.
- /aa affects a subset of what /a does.
- /s affects ..

# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
   my $utf8 = shift;

   # This will be used against a reversed string.
   state $incomplete_seq = qr/
      ^
      (?: (?&lead_of_2)
      |   (?&lead_of_3)
      |   (?&lead_of_4)
      |   (?&cont_byte) (?: (?&lead_of_3)
                        |   (?&lead_of_4)
                        |   (?&cont_byte) (?&lead_of_4) ))

      (?(DEFINE)
         (?<lead_of_1> [\x00-\x7F] )
         (?<cont_byte> [\x80-\xBF] )
         (?<lead_of_2> [\xC0-\xDF] )
         (?<lead_of_3> [\xE0-\xEF] )
         (?<lead_of_4> [\xF0-\xF7] )
         (?<invalid>   [\xF8-\xFF] )
      )
   /x;

   return reverse(substr($utf8, -4)) !~ $incomplete_seq;
}

Notes:

Adjust to handle invalid UTF-8 as desired.

There's no performance benefit to using qr// here; I did that for readability.

Performance could benefit from creating a class equivalent to (?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).

Performance would benefit from writing this function in C instead of using the regex engine.

The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
There is a such as excessive commenting :)
It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas is useless in your code.
- /a affects some built-in characters classes (e.g. \s) and \b.
- /aa affects a subset of what /a does.
- /s affects ..

added 48 characters in body

Source Link

edited May 28, 2020 at 21:24

ikegami

276
1
9

Loading

added 48 characters in body

Source Link

edited May 28, 2020 at 21:19

ikegami

276
1
9

Loading

added 50 characters in body

Source Link

edited May 28, 2020 at 21:11

ikegami

276
1
9

Loading

added 50 characters in body

Source Link

edited May 28, 2020 at 21:06

ikegami

276
1
9

Loading

added 50 characters in body

Source Link

edited May 28, 2020 at 21:01

ikegami

276
1
9

Loading

Source Link

answered May 28, 2020 at 20:53

ikegami

276
1
9

Loading

Stack Exchange Network

Return to Answer