0

I try to filter a variable allowing alphanumeric ,spaces ,accented characters , and single quotes and replace the reste by a space , so a string like :

substitué à une otage % ? vendredi 23 mars lors de l’attaque

should output :

substitué à une otage vendredi 23 mars lors de l’attaque

but I get as Result the output :

substitué à une otage vendredi 23 mars lors de l

could please help , this is my code

$whitelist = "/[^a-zA-Z0-9а-àâáçéèèêëìîíïôòóùûüÂÊÎÔúÛÄËÏÖÜÀÆæÇÉÈŒœÙñý',. ]/";

$descreption =  preg_replace($whitelist, ' ', $ds);
}else{
    $errors = self::DESCREPTION_ERROR;
    return false;
}
1
  • 2
    Depends on the encoding. /u Unicode flag usually for the regex. Also and ' are different characters. Commented Jan 3, 2019 at 8:53

3 Answers 3

3

Your regex is faulty. The part а-à gives the error Character range is out of order - I guess the - was added by mistake there...

Then a small hint: is not '

[^a-zA-Z0-9àâáçéèèêëìîíïôòóùûüÂÊÎÔúÛÄËÏÖÜÀÆæÇÉÈŒœÙñý'’,. ] 

should work fine.

Also, if you're working with Regex, tools like RegExr or regex101 are really a nice thing.

Sign up to request clarification or add additional context in comments.

2 Comments

thanks @maio290 for your response , but a string like : masculin de deux $%#$^%$&^dbd657657657*&()*)_()+!#!@#±?/ du paysage cultuel actuel. , it removes "du paysage cultuel actuel"
It doesn't here. And nor does it match on RegExr or regex101. It's not removing the ± - but that was it.
1

One way to deal with the range of accented characters is to use the POSIX [:alnum:] class, which in PHP in conjunction with the u modifier will match all of them. That can then be put into a negated character class with the other characters you want to keep to allow the other characters to be removed:

$string = 'substitué à une otage % ? vendredi 23 mars lors de l’attaque';
echo preg_replace("/[^[:alnum:]'’,.]/u", ' ', $string);

Output:

substitué à une otage vendredi 23 mars lors de l’attaque

As has been pointed out in the comments, is not the same as ' and so it also needs to be added to the set of characters you want to keep.

Demo on 3v4l.org

Comments

1

You may have a look at Unicode character properties.

Summary of my changes:

  • use \p{L} to match all letters
  • escape the hyphen (\-)
  • support typewriter (') and typographic () apostrophes

Here is the result:

$whitelist = '/[^\p{L}0-9\-\'’,. ]/u';

There is probably room for even further improvement. Finally, don't forget to add the u modifier!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.