0

I'm trying to find and then replace instances of hyphens with en and em widths.

So, in the example: "10-100" the hyphen would be replaced by an en width. Also, in the example: "It is - without doubt - the worst" or: "It is -- without doubt -- the worst" either instances would be replaced by an em width.

However, I just can't figure out the proper pattern for preg_replace() in PHP.

"/[0-9]+(\-)[0-9]+/"

... appears to do the replace, but removes the numbers.

How do I get preg_replace() to ignore the patterns either side of the subject?

2
  • 1
    With assertions. Commented May 3, 2014 at 16:05
  • I suppose it goes without saying that Regular Expressions are weird, but I got it working, so thanks! Commented May 3, 2014 at 16:30

2 Answers 2

1

You can use lookbehinds and lookaheads:

function prettyDashes($string) {
    static $regex = array(
        '/(?<=\d)-(?=\d)/' => '&ndash;',  // EN-dash
        '/(?<=\s)-(?=\s)/' => '&mdash;',  // EM-dash
        '/(?<=\w)--(?=\w)/' => '&mdash;', // EM-dash
    );
    return preg_replace(array_keys($regex), array_values($regex), $string);
}
$tests = array(
    'There are 10-20 dogs in the kennel.',
    'My day was - without a doubt - the worst!',
    'My day was--without a doubt--the worst!',
);
foreach ($tests as $test) {
    echo prettyDashes($test), '<br>';
}

The problem is that it is difficult to detect and avoid false-positives when replacing stuff like this. Normal hyphenated words, like "to-do", are not tangential (em-dash), and dates, like 18-12-2014, are not ranges (en-dash). You have to be quite conservative in what you replace, and you should not be surprised if something is changed erroneously.

Sign up to request clarification or add additional context in comments.

2 Comments

"The problem is that it is difficult to detect and avoid false-positives when replacing stuff like this." Which is exactly the problem I'm having at present, such as detection (and ignore) within a URL, as an example.
@WayneSmallman Yeah, that is just a problem that comes with doing this. You have to know every false-positive (or at least know how to identify them somehow) in order to avoid replacing correctly used hyphens. Implementing it would also be quite nontrivial.
0

So, with thanks to @mario, the answers are:

"/(?=.*?[0-9])(\-)(?=.*?[0-9])/"

"/(?=.*?\w)( \- )(?=.*?\w)/"

"/(?=.*?\w)( \-- )(?=.*?\w)/"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.