1

Consider the following strings

breaking out a of a simple prison
this is b moving up
following me is x times better

All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:

breaking out of simple prison
this is moving up
following me is times better

Is this possible with a single regex in php?

3
  • 3
    Yes, what do you have so far? Commented Apr 24, 2012 at 21:34
  • To think people want to match HTML with regex! This post is a good example of why matching HTML with regex is such a bad idea, if there are so many pitfalls in removing a single character from a text. Commented Apr 24, 2012 at 22:39
  • @Radu: Fortunately, whitespace is not as significant in HTML as it is in normal language :) (But in this case, it's a problem of unclearly defined specifications. If Pr0no (very mature nick, by the way, kid) had taken the time to think about his problem, he could have written a good question.) Commented Apr 25, 2012 at 4:57

5 Answers 5

3
$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("@\\b[a-z]\\b ?@i", "", $str);
echo $res;
Sign up to request clarification or add additional context in comments.

5 Comments

+1 for a clean solution although he was asking how to do it in a single regex
Also, this removes spaces from other parts of the text as well, not just those around the "loose" a-z characters.
@Tim, just noticed the word single. I tried a single regex but I was unable to figure out if I should eat the leading whitespace or the trailing one (it is 3AM :)
Right, it's just past midnight here, too, and I should be going to bed. But I think my solution covers most cases, removing at most one space, preferably the preceding one.
@Radu: just revised my answer.
2

How about:

preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);

Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).

If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:

preg_replace('/(^|\s)[a-z]\b/', '$1', $string);

4 Comments

^ and $ match the beginning and end of string, not after and before newlines. Consider adding m modifier.
@Salman: Yes, that's what I want. Why would you want to match line-wise? The \s takes care of newlines...
@Tim: It keeps the leading one (as captured by $1)
@Cameron: sorry, just realized that you're replacing with $1 instead of "" so it should work without m.
1

As a one-liner:

$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);

This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.

The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.

So

This a is my test sentence a. o How funny (what a coincidence a) this is!

is changed to

This is my test sentence. How funny (what coincidence) this is!

1 Comment

What if the leading whitespace is a new line? The regex would eat it. (I had the same problem before I gave up). (edit: got it, change the first \s to [ \t]).
1

You could try something like this:

preg_replace('/\b\S\s\b/', "", $subject);

This is what it means:

\b    # Assert position at a word boundary
\S    # Match a single character that is a “non-whitespace character”
\s    # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b    # Assert position at a word boundary

Update

As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.

As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this

test a  hello 

but you could fix that by changing the expression to \b\S\s*\b

9 Comments

Does \b match at ^ and $ in PHP? I know it doesn't in some regex engines...
You really should put this in single quotes. Also, if you don't use delimiters, it's invalid syntax.
@Cameron, yes PCRE matches start and end of lines for word boundaries
@Robbie, it's still invalid syntax. Also, \S matches a whole lot more than a-zA-Z.
@Robbie, even in C#, \S catches many other things aside from a-zA-Z.
|
0

Try this one:

$sString = preg_replace("@\b[a-z]{1}\b@m", ' ', $sString);

4 Comments

yah, you are right, in this case, it is. we need it if more symbols is needed.. like {1,3} will remove 1 to 3 symbols words.
What do the @ characters mean in php regex?
@DavidThomas, they're just delimiters.
@DavidThomas: as far as preg_* functions are concerned, most punctuation characters and symbols can be used as delimiters.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.