122

I'm looking for a php function that will take an input string and return a sanitized version of it by stripping away all special characters leaving only alpha-numeric.

I need a second function that does the same but only returns alphabetic characters A-Z.

Any help much appreciated.

10
  • Which Unicode Normalization Form are these in, and whyever would you want to do this? Commented Mar 4, 2011 at 21:03
  • 1
    When you say A-Z and 'alphanumeric', do you really mean only A-Z or do you want to match all letters from all languages, including foreign languages and obsolete scripts? Commented Mar 4, 2011 at 21:04
  • If you’e doing this so you can do an accent-insensitive string comparison, you’re doing the wrong thing. Commented Mar 4, 2011 at 21:08
  • 3
    It’s not just “from all languages”. It’s English. English uses the Latin script. There are unichars '\p{Latin}' '\p{Alphabetic}' '[^A-Za-z]' | wc -l == 1192 code points that are Latin alphabetics but which are not A-Z. It is commonly held myth that ASCII is enough for English. It’s not, and that’s why writing A-Z has a code smell to it. Commented Mar 4, 2011 at 21:10
  • 1
    @Scott B: English doesn't just use the 26 letters from A-Z. For example the word résumé includes é. Perhaps you could explain what you are trying to do as this might help get you better answers. Commented Mar 4, 2011 at 21:17

4 Answers 4

265

Warning: Note that English is not restricted to just A-Z.

Try this to remove everything except a-z, A-Z and 0-9:

$result = preg_replace("/[^a-zA-Z0-9]+/", "", $s);

If your definition of alphanumeric includes letters in foreign languages and obsolete scripts then you will need to use the Unicode character classes.

Try this to leave only A-Z:

$result = preg_replace("/[^A-Z]+/", "", $s);

The reason for the warning is that words like résumé contains the letter é that won't be matched by this. If you want to match a specific list of letters adjust the regular expression to include those letters. If you want to match all letters, use the appropriate character classes as mentioned in the comments.

Sign up to request clarification or add additional context in comments.

11 Comments

No, an alphanumeric is [\p{Alphabetic}\p{Numeric}]. I forget the PCRE alphabetic property, but you can approximate it with [\pL\pM\pN].
@tchrist: I assume that because he specifically mentioned A-Z that he only wants to match that, though I admit that the question could be a lot more clear on this point. I'll ask for a clarification.
@Mark, I wasn’t arguing with the second part of your answer, although if he hasn’t canonically decomposed the string first, it won’t work right. I was arguing with the first part. Also, I try to always right regexes that work on any data, not just on moldy old ASCII. :) Hence the mantra that this side of Millennium, [A-Z] is always wrong, sometimes .
@Mark Byers, I see.. and Yes I prefer the i but I have only ever has to worry about an English demographic .. I forget many people have to think about other languages. BTW I just noticed you are the highest rep'd user who has never asked 1 question. Even Jon Skeet has asked questions before!
why is there a + at the end of the regexp? Wouldn't it be ... same if you remove it?
|
3

try this to keep accentuated characters:

$result = preg_replace("/[^A-zÀ-ú0-9]+/", "", $s);

Comments

0

Rather than preg_replace, you could always use PHP's filter functions using the filter_var() function with FILTER_SANITIZE_STRING.

7 Comments

Does PHP have access to the ISO Stringprep algorithm? I know Perl and Java do.
I believe the string filter function works predominantly with 7-bit ASCII, but don't quote me on that.
Please, can you tell us an explicit way of doing what the user is asking for using FILTER_SANITIZE_STRING? To my knowledge, the closest that can be archieved this way is with FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW | FILTER_FLAG_STRIP_HIGH, but that won't leave just letters and numbers but also dots, slashes, percents and that all.
It looks more like a comment rather than an answer. Give a proper explanation while writing an answer.
I don't believe there is an actual FILTER_SANITIZE to alphanumeric on there, unfortunately. Pretty major omission.
|
0

If You want to keep only alphanumeric with language-specific accents I would rather use:

$clean_string = preg_replace("/[^\w\s+$]/u", "", $string);

This will do the job and keep special characters like: ą, ó, ę, ö, ñ.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.