Function to return only alpha-numeric characters from string?

Question

I'm looking for a php function that will take an input string and return a sanitized version of it by stripping away all special characters leaving only alpha-numeric.

I need a second function that does the same but only returns alphabetic characters A-Z.

Any help much appreciated.

Which Unicode Normalization Form are these in, and whyever would you want to do this? — tchrist
– tchrist, Commented Mar 4, 2011 at 21:03
When you say A-Z and 'alphanumeric', do you really mean only A-Z or do you want to match all letters from all languages, including foreign languages and obsolete scripts? — Mark Byers
– Mark Byers, Commented Mar 4, 2011 at 21:04
If you’e doing this so you can do an accent-insensitive string comparison, you’re doing the wrong thing. — tchrist
– tchrist, Commented Mar 4, 2011 at 21:08
It’s not just “from all languages”. It’s English. English uses the Latin script. There are unichars '\p{Latin}' '\p{Alphabetic}' '[^A-Za-z]' | wc -l == 1192 code points that are Latin alphabetics but which are not A-Z. It is commonly held myth that ASCII is enough for English. It’s not, and that’s why writing A-Z has a code smell to it. — tchrist
– tchrist, Commented Mar 4, 2011 at 21:10
@Scott B: English doesn't just use the 26 letters from A-Z. For example the word résumé includes é. Perhaps you could explain what you are trying to do as this might help get you better answers. — Mark Byers
– Mark Byers, Commented Mar 4, 2011 at 21:17

Mark Byers · Accepted Answer · 2011-03-04 21:22:26Z

265

Warning: Note that English is not restricted to just A-Z.

Try this to remove everything except a-z, A-Z and 0-9:

$result = preg_replace("/[^a-zA-Z0-9]+/", "", $s);

If your definition of alphanumeric includes letters in foreign languages and obsolete scripts then you will need to use the Unicode character classes.

Try this to leave only A-Z:

$result = preg_replace("/[^A-Z]+/", "", $s);

The reason for the warning is that words like résumé contains the letter é that won't be matched by this. If you want to match a specific list of letters adjust the regular expression to include those letters. If you want to match all letters, use the appropriate character classes as mentioned in the comments.

edited Mar 4, 2011 at 21:22

answered Mar 4, 2011 at 20:58

Mark Byers

844k202 gold badges1.6k silver badges1.5k bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

tchrist Over a year ago

No, an alphanumeric is [\p{Alphabetic}\p{Numeric}]. I forget the PCRE alphabetic property, but you can approximate it with [\pL\pM\pN].

Mark Byers Over a year ago

@tchrist: I assume that because he specifically mentioned A-Z that he only wants to match that, though I admit that the question could be a lot more clear on this point. I'll ask for a clarification.

tchrist Over a year ago

@Mark, I wasn’t arguing with the second part of your answer, although if he hasn’t canonically decomposed the string first, it won’t work right. I was arguing with the first part. Also, I try to always right regexes that work on any data, not just on moldy old ASCII. :) Hence the mantra that this side of Millennium, [A-Z] is always wrong, sometimes .

JD Isaacks Over a year ago

@Mark Byers, I see.. and Yes I prefer the i but I have only ever has to worry about an English demographic .. I forget many people have to think about other languages. BTW I just noticed you are the highest rep'd user who has never asked 1 question. Even Jon Skeet has asked questions before!

Dennis Over a year ago

why is there a + at the end of the regexp? Wouldn't it be ... same if you remove it?

|

Oli · Accepted Answer · 2022-11-30 13:44:11Z

3

try this to keep accentuated characters:

$result = preg_replace("/[^A-zÀ-ú0-9]+/", "", $s);

answered Nov 30, 2022 at 13:44

Oli

1,69218 silver badges14 bronze badges

Comments

samayo · Accepted Answer · 2016-07-05 15:02:36Z

0

Rather than preg_replace, you could always use PHP's filter functions using the filter_var() function with FILTER_SANITIZE_STRING.

edited Jul 5, 2016 at 15:02

samayo

16.5k13 gold badges95 silver badges115 bronze badges

answered Mar 4, 2011 at 21:16

Mark Baker

213k34 gold badges354 silver badges390 bronze badges

7 Comments

tchrist Over a year ago

Does PHP have access to the ISO Stringprep algorithm? I know Perl and Java do.

Mark Baker Over a year ago

I believe the string filter function works predominantly with 7-bit ASCII, but don't quote me on that.

Pere Over a year ago

Please, can you tell us an explicit way of doing what the user is asking for using FILTER_SANITIZE_STRING? To my knowledge, the closest that can be archieved this way is with FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW | FILTER_FLAG_STRIP_HIGH, but that won't leave just letters and numbers but also dots, slashes, percents and that all.

Siraj Alam Over a year ago

It looks more like a comment rather than an answer. Give a proper explanation while writing an answer.

Kzqai Over a year ago

I don't believe there is an actual FILTER_SANITIZE to alphanumeric on there, unfortunately. Pretty major omission.

|

Eryk Wróbel · Accepted Answer · 2024-08-22 10:50:26Z

0

If You want to keep only alphanumeric with language-specific accents I would rather use:

$clean_string = preg_replace("/[^\w\s+$]/u", "", $string);

This will do the job and keep special characters like: ą, ó, ę, ö, ñ.

answered Aug 22, 2024 at 10:50

Eryk Wróbel

4761 gold badge4 silver badges14 bronze badges

Collectives™ on Stack Overflow

Function to return only alpha-numeric characters from string?

4 Answers 4

11 Comments

Comments

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

11 Comments

Comments

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related