Regex to remove single characters from string

Question

Consider the following strings

breaking out a of a simple prison
this is b moving up
following me is x times better

All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:

breaking out of simple prison
this is moving up
following me is times better

Is this possible with a single regex in php?

To think people want to match HTML with regex! This post is a good example of why matching HTML with regex is such a bad idea, if there are so many pitfalls in removing a single character from a text. — rid
– rid, Commented Apr 24, 2012 at 22:39
@Radu: Fortunately, whitespace is not as significant in HTML as it is in normal language :) (But in this case, it's a problem of unclearly defined specifications. If Pr0no (very mature nick, by the way, kid) had taken the time to think about his problem, he could have written a good question.) — Tim Pietzcker
– Tim Pietzcker, Commented Apr 25, 2012 at 4:57

Salman Arshad · Accepted Answer · 2012-04-24 22:55:54Z

3

$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("@\\b[a-z]\\b ?@i", "", $str);
echo $res;

edited Apr 24, 2012 at 22:55

answered Apr 24, 2012 at 21:49

Salman Arshad

274k85 gold badges450 silver badges540 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Tim Pietzcker Over a year ago

+1 for a clean solution although he was asking how to do it in a single regex

rid Over a year ago

Also, this removes spaces from other parts of the text as well, not just those around the "loose" a-z characters.

Salman Arshad Over a year ago

@Tim, just noticed the word single. I tried a single regex but I was unable to figure out if I should eat the leading whitespace or the trailing one (it is 3AM :)

Tim Pietzcker Over a year ago

Right, it's just past midnight here, too, and I should be going to bed. But I think my solution covers most cases, removing at most one space, preferably the preceding one.

Salman Arshad Over a year ago

@Radu: just revised my answer.

Cameron · Accepted Answer · 2012-04-24 21:47:07Z

2

How about:

preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);

Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).

If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:

preg_replace('/(^|\s)[a-z]\b/', '$1', $string);

edited Apr 24, 2012 at 21:47

answered Apr 24, 2012 at 21:38

Cameron

99.4k29 gold badges206 silver badges234 bronze badges

4 Comments

Salman Arshad Over a year ago

^ and $ match the beginning and end of string, not after and before newlines. Consider adding m modifier.

Cameron Over a year ago

@Salman: Yes, that's what I want. Why would you want to match line-wise? The \s takes care of newlines...

Cameron Over a year ago

@Tim: It keeps the leading one (as captured by $1)

Salman Arshad Over a year ago

@Cameron: sorry, just realized that you're replacing with $1 instead of "" so it should work without m.

Tim Pietzcker · Accepted Answer · 2012-04-24 22:11:46Z

1

As a one-liner:

$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);

This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.

The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.

So

This a is my test sentence a. o How funny (what a coincidence a) this is!

is changed to

This is my test sentence. How funny (what coincidence) this is!

edited Apr 24, 2012 at 22:11

answered Apr 24, 2012 at 21:59

Tim Pietzcker

337k59 gold badges521 silver badges572 bronze badges

1 Comment

Salman Arshad Over a year ago

What if the leading whitespace is a new line? The regex would eat it. (I had the same problem before I gave up). (edit: got it, change the first \s to [ \t]).

Community · Accepted Answer · 2017-05-23 10:34:49Z

1

You could try something like this:

preg_replace('/\b\S\s\b/', "", $subject);

This is what it means:

\b    # Assert position at a word boundary
\S    # Match a single character that is a “non-whitespace character”
\s    # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b    # Assert position at a word boundary

Update

As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.

As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this

test a  hello

but you could fix that by changing the expression to \b\S\s*\b

edited May 23, 2017 at 10:34

CommunityBot

11 silver badge

answered Apr 24, 2012 at 21:40

Robbie

19.6k4 gold badges44 silver badges45 bronze badges

9 Comments

Cameron Over a year ago

Does \b match at ^ and $ in PHP? I know it doesn't in some regex engines...

rid Over a year ago

You really should put this in single quotes. Also, if you don't use delimiters, it's invalid syntax.

Robbie Over a year ago

@Cameron, yes PCRE matches start and end of lines for word boundaries

rid Over a year ago

@Robbie, it's still invalid syntax. Also, \S matches a whole lot more than a-zA-Z.

rid Over a year ago

@Robbie, even in C#, \S catches many other things aside from a-zA-Z.

|

Wouter Dorgelo · Accepted Answer · 2012-07-16 13:50:11Z

0

Try this one:

$sString = preg_replace("@\b[a-z]{1}\b@m", ' ', $sString);

edited Jul 16, 2012 at 13:50

Wouter Dorgelo

12.1k12 gold badges67 silver badges80 bronze badges

answered Apr 24, 2012 at 21:42

Anton

1,0518 silver badges21 bronze badges

4 Comments

Anton Over a year ago

yah, you are right, in this case, it is. we need it if more symbols is needed.. like {1,3} will remove 1 to 3 symbols words.

David Thomas Over a year ago

What do the @ characters mean in php regex?

rid Over a year ago

@DavidThomas, they're just delimiters.

Salman Arshad Over a year ago

@DavidThomas: as far as preg_* functions are concerned, most punctuation characters and symbols can be used as delimiters.

Collectives™ on Stack Overflow

Regex to remove single characters from string

5 Answers 5

5 Comments

4 Comments

1 Comment

9 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

5 Comments

4 Comments

1 Comment

9 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related