2

I am having hard times finding a way to get the unicode class of a char.

list of unicode classes: https://www.php.net/manual/en/regexp.reference.unicode.php

The desired function in python: https://docs.python.org/3/library/unicodedata.html#unicodedata.category

I just want the PHP equivalent to this python function.

For example, if I called the x function like this: x('-') it would return Pd because Pd is the class hyphen belongs to.

Thanks.

4
  • Of the 1,335 Unicode 14 properties, the - matches 37. Which ones exactly are you interested in. Can't just say punctuation. Commented Feb 12, 2022 at 20:11
  • Available General Category : Close_Punctuation, Connector_Punctuation, Control, Currency_Symbol, Dash_Punctuation, Decimal_Number, Enclosing_Mark, Final_Punctuation, Format, Initial_Punctuation, Letter_Number, Line_Separator, Lowercase_Letter, Math_Symbol, Modifier_Letter, Modifier_Symbol, Nonspacing_Mark, Open_Punctuation, Other_Letter, Other_Number, Other_Punctuation, Other_Symbol, Paragraph_Separator, Private_Use, Space_Separator, Spacing_Mark, Surrogate, Titlecase_Letter, Uppercase_Letter Commented Feb 12, 2022 at 20:27
  • Available General Category Mask: Cased_Letter, Decimal_Number, Enclosing_Mark, Letter, Letter_Number, Lowercase_Letter, Mark, Modifier_Letter, Nonspacing_Mark, Other_Letter, Spacing_Mark, Titlecase_Letter, Unassigned, Uppercase_Letter Commented Feb 12, 2022 at 20:28
  • Be careful, there is a lot of overlap !! What you see above is actual UCD properties. Any shortcut you use provided by your language resolve to this actual V14 property function provided. Or whatever the latest Unicode version your package uses. Commented Feb 12, 2022 at 20:28

3 Answers 3

3

A possible way is to use IntlChar::charType. Unfortunately, this method returns only an int, but this int is a constant defined in the IntlChar class. All the constants for the 30 categories are in a 0 to 29 range (no gaps). Conclusion, all you have to do is to build a indexed array that follows the same order:

$shortCats = [
    'Cn', 'Lu', 'Ll', 'Lt', 'Lm', 'Lo',
    'Mn', 'Me', 'Mc', 'Nd', 'Nl', 'No',
    'Zs', 'Zl', 'Zp', 'Cc', 'Cf', 'Co',
    'Cs', 'Pd', 'Ps', 'Pe', 'Pc', 'Po',
    'Sm', 'Sc', 'Sk', 'So', 'Pi', 'Pf'
];

echo $shortCats[IntlChar::charType('-')]; //Pd

Notice: If you are afraid that the numeric values defined in the class change in the futur and want to be more rigorous, You can also write the array this way:

$shortCats = [
    IntlChar::CHAR_CATEGORY_UNASSIGNED => 'Cn',
    IntlChar::CHAR_CATEGORY_UPPERCASE_LETTER => 'Lu',
    IntlChar::CHAR_CATEGORY_LOWERCASE_LETTER => 'Ll',
    IntlChar::CHAR_CATEGORY_TITLECASE_LETTER => 'Lt',
    // etc.
];
Sign up to request clarification or add additional context in comments.

4 Comments

You're the best, it fixed the problem, it's more reliable than my own answer, so I accept yours. Thank you !!
@Eissaweb: Many thanks. Your solution is working too and that, reliable without the use of the intl package.
@Eissaweb: Thank you also for your original question, the kind of questions that are becoming rare.
I appreciate you!!
2

So Apparently there is no built-in function that does that, so I wrote this function:

<?php
$UNICODE_CATEGORIES = [
        "Cc",
        "Cf",
        "Cs",
        "Co",
        "Cn",
        "Lm",
        "Mn",
        "Mc",
        "Me",
        "No",
        "Zs",
        "Zl" ,
        "Zp",
        "Pc",
        "Pd",
        "Ps" ,
        "Pe" ,
        "Pi" ,
        "Pf" ,
        "Po" ,
        "Sm",
        "Sc",
        "Sk",
        "So",
        "Zs",
        "Zl",
        "Zp"
    ];

function uni_category($char, $UNICODE_CATEGORIES) {
    foreach ($UNICODE_CATEGORIES as $category) {
        if (preg_match('/\p{'.$category.'}/', $char))
            return $category;
    } 
    return null;
}
// call the function 
print uni_category('-', $UNICODE_CATEGORIES); // it returns Pd

This code works for me, I hope it helps someby in the future :).

Comments

1

I'm posting this as it might be useful. Have done this before on a very large scale.

Below is a condensed way to do it using PHP.

Notes:

A single regex is generated once at startup.
It contains a Lookahead Assertion with a capture group for each Property.
Example: (?=(\p{Property1}))?(?=(\p{Property2}))? ... (?=(\p{PropertyN}))?
Each character in the target is checked for all the properties in the array.
Each capture group represents an index into the character array $General_Cat_Props
that is it's association when a match is analyzed
for printing.

This solves the issues that each character can be matched by many properties.
Basically add the properties of interest to $General_Cat_Props.
No other change is necessary.

There are 2 functions:

  1. Get_UniCategories_From_Char( $char ) analyze a character at a time.
  2. Get_UniCategories_From_String( $str ) for strings ( calls 1 on each character ).

Obviously it is noteworthy that the array $General_Cat_Props below can be added to or removed from as needed, for a custom filter.
There can be many specific constant property arrays as needed for special checks. The array order of the properties is irrelevant.

Regex101 quick global test bed

/(?=.)(?=(\p{Cn}))?(?=(\p{Cc}))?(?=(\p{Cf}))?(?=(\p{Co}))?(?=(\p{Cs}))?(?=(\p{Lu}))?(?=(\p{Ll}))?(?=(\p{Lt}))?(?=(\p{Lm}))?(?=(\p{Lo}))?(?=(\p{Mn}))?(?=(\p{Me}))?(?=(\p{Mc}))?(?=(\p{Pd}))?(?=(\p{Ps}))?(?=(\p{Pe}))?(?=(\p{Pc}))?(?=(\p{Po}))?(?=(\p{Pi}))?(?=(\p{Pf}))?(?=(\p{Sm}))?(?=(\p{Sc}))?(?=(\p{Sk}))?(?=(\p{So}))?(?=(\p{Zs}))?(?=(\p{Zl}))?(?=(\p{Zp}))?/su

https://regex101.com/r/fvVZX0/1

PHP
Mod: After realizing php only populates the $match array up until the last optional group matched, a check was added when creating the result (see $last_grp_matched = sizeof($matches);).

Previously it was being forced by adding a capture group (.) at the end. The old code still works, use/see previous version if needed.

http://sandbox.onlinephpfunctions.com/code/f1aeca3d9a99d1b2d1bfc72c3dd004ad232bc29e

<?php

// The prop array
$General_Cat_Props = [
"",
"Cn", "Cc", "Cf", "Co", "Cs",
"Lu", "Ll", "Lt", "Lm", "Lo",
"Mn", "Me", "Mc", // "Nd", "Nl", "No",
"Pd", "Ps", "Pe", "Pc", "Po", "Pi", "Pf",
"Sm", "Sc", "Sk", "So",
"Zs", "Zl", "Zp"
];

// The Rx
$GCRx;

// One-time make function
function makeGCRx()
{
    global $General_Cat_Props, $GCRx ;
    $rxstr = "(?=.)";     // Start of regex, something must be ahead
    for ($i = 1; $i < sizeof( $General_Cat_Props ); $i++) {
        $rxstr .= "(?=(\\p{" . $General_Cat_Props[ $i ] . "}))?";
    }
    $GCRx = "/$rxstr/su";
}

makeGCRx();
// print_r($GCRx . "\n");

function Get_UniCategories_From_Char( $char )
{
    global $General_Cat_Props, $GCRx;
    $ret = "";
    if ( preg_match( $GCRx, $char, $matches )) {
        $last_grp_matched = sizeof($matches);
        for ($i = 1; $i < sizeof( $General_Cat_Props ), $i < $last_grp_matched; $i++) {
            if ( $matches[ $i ] != null ) {
                $ret .= $General_Cat_Props[ $i ] . " ";
            }
        }
    }
    return $ret;
}

function Get_UniCategories_From_String( $str )
{
    $ret = "";
    for ($i = 0; $i < strlen( $str ); $i++) {
        $ret .= $str[ $i ] . "  " . Get_UniCategories_From_Char( $str[ $i ] ) . "\n";
    }
    return $ret;
}

print_r( "-  " . Get_UniCategories_From_Char( "-" ) . "\n--------\n" );
// or 
print_r( Get_UniCategories_From_String( "Hello 270 -,+?" ) . "\n" );

Output:

-  Pd 
--------
H  Lu 
e  Ll 
l  Ll 
l  Ll 
o  Ll 
   Zs 
2  
7  
0  
   Zs 
-  Pd 
,  Po 
+  Sm 
?  Po 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.