Skip to main content
deleted 48 characters in body
Source Link
ikegami
  • 276
  • 1
  • 9
  • This sub is expected to return a scalar (something true or false). And while your sub does that when it's invoked in scalar context, it produces surprising results in list context. Replace return with return 0;.
  • Is 1; really better than return 1;?
  • The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
  • There is a such as excessive commenting :)
  • It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
  • /aas is useless in your code.
    • /aaa affects some built-in characters classes (e.g. \s) and \b.
    • /aa affects a subset of what /a does.
    • /s affects ..
  • The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
  • There is a such as excessive commenting :)
  • It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
  • /aas is useless in your code.
    • /a affects some built-in characters classes (e.g. \s) and \b.
    • /aa affects a subset of what /a does.
    • /s affects ..
  • This sub is expected to return a scalar (something true or false). And while your sub does that when it's invoked in scalar context, it produces surprising results in list context. Replace return with return 0;.
  • Is 1; really better than return 1;?
  • The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
  • There is a such as excessive commenting :)
  • It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
  • /aas is useless in your code.
    • /aa affects some built-in characters classes (e.g. \s) and \b.
    • /s affects ..
added 234 characters in body
Source Link
ikegami
  • 276
  • 1
  • 9

Performance improvements:

# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
   my $utf8 = shift;

   # This will be used against a reversed string.
   state $incomplete_seq = qr/
      ^
      (?: (?&lead_of_2)
      |   (?&lead_of_3)
      |   (?&lead_of_4)
      |   (?&cont_byte) (?: (?&lead_of_3)
                        |   (?&lead_of_4)
                        |   (?&cont_byte) (?&lead_of_4) ))

      (?(DEFINE)
         (?<lead_of_1> [\x00-\x7F] )
         (?<cont_byte> [\x80-\xBF] )
         (?<lead_of_2> [\xC0-\xDF] )
         (?<lead_of_3> [\xE0-\xEF] )
         (?<lead_of_4> [\xF0-\xF7] )
         (?<invalid>   [\xF8-\xFF] )
      )
   /x;

   return reverse(substr($utf8, -4)) !~ $incomplete_seq;
}

Notes:

  • Adjust to handle invalid UTF-8 as desired.
  • I started by writing a version that used a pattern that looked for complete sequences, but it handled invalid UTF-8 poorly. If invalid seqences are to be considered complete, it's cleaner and simpler to match incomplete sequences.
  • There's no performance benefit to using qr// here; I did that for readability.
  • Performance could benefit from creating a class equivalent to (?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).
  • Performance would benefit from writing this function in C instead of using the regex engine.

Comments on your code:

  • The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
  • There is a such as excessive commenting :)
  • It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
  • /aas is useless in your code.
    • /a affects some built-in characters classes (e.g. \s) and \b.
    • /aa affects a subset of what /a does.
    • /s affects ..

Performance improvements:

# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
   my $utf8 = shift;

   # This will be used against a reversed string.
   state $incomplete_seq = qr/
      ^
      (?: (?&lead_of_2)
      |   (?&lead_of_3)
      |   (?&lead_of_4)
      |   (?&cont_byte) (?: (?&lead_of_3)
                        |   (?&lead_of_4)
                        |   (?&cont_byte) (?&lead_of_4) ))

      (?(DEFINE)
         (?<lead_of_1> [\x00-\x7F] )
         (?<cont_byte> [\x80-\xBF] )
         (?<lead_of_2> [\xC0-\xDF] )
         (?<lead_of_3> [\xE0-\xEF] )
         (?<lead_of_4> [\xF0-\xF7] )
         (?<invalid>   [\xF8-\xFF] )
      )
   /x;

   return reverse(substr($utf8, -4)) !~ $incomplete_seq;
}

Notes:

  • Adjust to handle invalid UTF-8 as desired.
  • There's no performance benefit to using qr// here; I did that for readability.
  • Performance could benefit from creating a class equivalent to (?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).
  • Performance would benefit from writing this function in C instead of using the regex engine.

Comments on your code:

  • The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
  • There is a such as excessive commenting :)
  • It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
  • /aas is useless in your code.
    • /a affects some built-in characters classes (e.g. \s) and \b.
    • /aa affects a subset of what /a does.
    • /s affects ..

Performance improvements:

# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
   my $utf8 = shift;

   # This will be used against a reversed string.
   state $incomplete_seq = qr/
      ^
      (?: (?&lead_of_2)
      |   (?&lead_of_3)
      |   (?&lead_of_4)
      |   (?&cont_byte) (?: (?&lead_of_3)
                        |   (?&lead_of_4)
                        |   (?&cont_byte) (?&lead_of_4) ))

      (?(DEFINE)
         (?<lead_of_1> [\x00-\x7F] )
         (?<cont_byte> [\x80-\xBF] )
         (?<lead_of_2> [\xC0-\xDF] )
         (?<lead_of_3> [\xE0-\xEF] )
         (?<lead_of_4> [\xF0-\xF7] )
         (?<invalid>   [\xF8-\xFF] )
      )
   /x;

   return reverse(substr($utf8, -4)) !~ $incomplete_seq;
}

Notes:

  • Adjust to handle invalid UTF-8 as desired.
  • I started by writing a version that used a pattern that looked for complete sequences, but it handled invalid UTF-8 poorly. If invalid seqences are to be considered complete, it's cleaner and simpler to match incomplete sequences.
  • There's no performance benefit to using qr// here; I did that for readability.
  • Performance could benefit from creating a class equivalent to (?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).
  • Performance would benefit from writing this function in C instead of using the regex engine.

Comments on your code:

  • The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
  • There is a such as excessive commenting :)
  • It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
  • /aas is useless in your code.
    • /a affects some built-in characters classes (e.g. \s) and \b.
    • /aa affects a subset of what /a does.
    • /s affects ..
added 48 characters in body
Source Link
ikegami
  • 276
  • 1
  • 9
# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
   my $utf8 = shift;

   # This will be used against a reversed string.
   state $complete_seq$incomplete_seq = qr/
      ^
      (?: (?&invalid)
      |   (?&lead_of_1&lead_of_2)
      |   (?&cont_byte) (?: (?&invalid&lead_of_3)
                        |   (?&lead_of_2&lead_of_4)
                        |   (?&cont_byte) (?: (?&invalid&lead_of_3)
                                          |   (?&lead_of_3&lead_of_4)
                                          |   (?&cont_byte) (?&lead_of_4) ))

      (?(DEFINE)
         (?<lead_of_1> [\x00-\x7F] )
         (?<cont_byte> [\x80-\xBF] )
         (?<lead_of_2> [\xC0-\xDF] )
         (?<lead_of_3> [\xE0-\xEF] )
         (?<lead_of_4> [\xF0-\xF7] )
         (?<invalid>   [\xF8-\xFF] )
      )
   /x;

   return reverse(substr($utf8, -4)) =~!~ $complete_seq;$incomplete_seq;
}

I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name.

Adjust to handle invalid UTF-8 as desired.

There's no benefit to using qr// here; I did that for readability.Notes:

Performance would benefit from writing this function in C instead of using the regex engine.

  • Adjust to handle invalid UTF-8 as desired.
  • There's no performance benefit to using qr// here; I did that for readability.
  • Performance could benefit from creating a class equivalent to (?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).
  • Performance would benefit from writing this function in C instead of using the regex engine.
  • The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
  • There is a such as excessive commenting :)
  • It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
  • /aas is useless in your code.
    • /a affects some built-in characters classes (e.g. \s) and \b.
    • /aa affects a subset of what /a does.
    • /s affects ..
# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
   my $utf8 = shift;

   # This will be used against a reversed string.
   state $complete_seq = qr/
      ^
      (?: (?&invalid)
      |   (?&lead_of_1)
      |   (?&cont_byte) (?: (?&invalid)
                        |   (?&lead_of_2)
                        |   (?&cont_byte) (?: (?&invalid)
                                          |   (?&lead_of_3)
                                          |   (?&cont_byte) )))

      (?(DEFINE)
        (?<lead_of_1> [\x00-\x7F] )
        (?<cont_byte> [\x80-\xBF] )
        (?<lead_of_2> [\xC0-\xDF] )
        (?<lead_of_3> [\xE0-\xEF] )
        (?<lead_of_4> [\xF0-\xF7] )
        (?<invalid>   [\xF8-\xFF] )
      )
   /x;

   return reverse(substr($utf8, -4)) =~ $complete_seq;
}

I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name.

Adjust to handle invalid UTF-8 as desired.

There's no benefit to using qr// here; I did that for readability.

Performance would benefit from writing this function in C instead of using the regex engine.

  • There is a such as excessive commenting :)
  • It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
  • /aas is useless in your code.
    • /a affects some built-in characters classes (e.g. \s) and \b.
    • /aa affects a subset of what /a does.
    • /s affects ..
# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
   my $utf8 = shift;

   # This will be used against a reversed string.
   state $incomplete_seq = qr/
      ^
      (?: (?&lead_of_2)
      |   (?&lead_of_3)
      |   (?&lead_of_4)
      |   (?&cont_byte) (?: (?&lead_of_3)
                        |   (?&lead_of_4)
                        |   (?&cont_byte) (?&lead_of_4) ))

      (?(DEFINE)
         (?<lead_of_1> [\x00-\x7F] )
         (?<cont_byte> [\x80-\xBF] )
         (?<lead_of_2> [\xC0-\xDF] )
         (?<lead_of_3> [\xE0-\xEF] )
         (?<lead_of_4> [\xF0-\xF7] )
         (?<invalid>   [\xF8-\xFF] )
      )
   /x;

   return reverse(substr($utf8, -4)) !~ $incomplete_seq;
}

Notes:

  • Adjust to handle invalid UTF-8 as desired.
  • There's no performance benefit to using qr// here; I did that for readability.
  • Performance could benefit from creating a class equivalent to (?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).
  • Performance would benefit from writing this function in C instead of using the regex engine.
  • The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
  • There is a such as excessive commenting :)
  • It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
  • /aas is useless in your code.
    • /a affects some built-in characters classes (e.g. \s) and \b.
    • /aa affects a subset of what /a does.
    • /s affects ..
added 48 characters in body
Source Link
ikegami
  • 276
  • 1
  • 9
Loading
added 48 characters in body
Source Link
ikegami
  • 276
  • 1
  • 9
Loading
added 50 characters in body
Source Link
ikegami
  • 276
  • 1
  • 9
Loading
added 50 characters in body
Source Link
ikegami
  • 276
  • 1
  • 9
Loading
added 50 characters in body
Source Link
ikegami
  • 276
  • 1
  • 9
Loading
Source Link
ikegami
  • 276
  • 1
  • 9
Loading