1

I'm building a task (in PHP) that reads all the files of my project in search for i18n messages. I want to detect messages like these:

// Basic example
__('Show in English')  => Show in English
// Get the message and the name of the i18n file 
__("Show in English", array(), 'page') => Show in English, page
// Be careful of quotes
__("View Mary's Car", array()) => View Mary's Car
// Be careful of strings after the __() expression
__('at').' '.function($param) => at

The regex expression that works for those cases (there are some other cases taken into account) is:

__\(.*?['|\"](.*?)(?:['|\"][\.|,|\)])(?: *?array\(.*?\),.*?['|\"](.*?)['|\"]\)[^\)])?

However if the expression is in multiple lines it doesn't work. I have to include dotail /s, but it breaks the previous regex expresion as it doesn't control well when to stop looking ahead:

// Detect with multiple lines
echo __('title_in_place', array(
    '%title%' => $place['title']
  ), 'welcome-user'); ?>    

There is one thing that will solve the problem and simplify the regex expression that it's matching open-close parentheses. So no matter what's inside __() or how many parentheses there are, it "counts" the number of openings and expects that number of closings.

Is it possible? How? Thanks a lot!

4 Answers 4

1

Yes. First, here is the classic example for simple nested brackets (parentheses):

\(([^()]|(?R))*\)

or faster versions which use a possesive quantifier:

\(([^()]++|(?R))*\)

or (equivalent) atomic grouping:

\((?>[^()]+|(?R))*\)

But you can't use the: (?R) "match whole expression" expression here because the outermost brackets are special (with two leading underscores). Here is a tested script which matches (what I think) you want...

Solution: Use group $1 (recursive) subroutine call: (?1)

<?php // test.php Rev:20120625_2200
$re_message = '/
    # match __(...(...)...) message lines (having arbitrary nesting depth).
    __\(                     # Outermost opening bracket (with leading __().
    (                        # Group $1: Bracket contents (subroutine).
      (?:                    # Group of bracket contents alternatives.
        [^()"\']++           # Either one or more non-brackets, non-quotes,
      | "[^"\\\\]*(?:\\\\[\S\s][^"\\\\]*)*"      # or a double quoted string,
      | \'[^\'\\\\]*(?:\\\\[\S\s][^\'\\\\]*)*\'  # or a single quoted string,
      | \( (?1) \)          # or a nested bracket (repeat group 1 here!).
      )*                    # Zero or more bracket contents alternatives.
    )                       # End $1: recursed subroutine.
    \)                      # Outermost closing bracket.
    .*                      # Match remainder of line following __()
    /mx';
$data = file_get_contents('testdata.txt');
$count = preg_match_all($re_message, $data, $matches);
printf("There were %d __(...) messages found.\n", $count);
for ($i = 0; $i < $count; ++$i) {
    printf("  message[%d]: %s\n", $i + 1, $matches[0][$i]);
}
?>

Note that this solution handles balanced parentheses (inside the "__(...)" construct) to any arbitrary depth (limited only by host memory). It also correctly handles quoted strings inside the "__(...)" and ignores any parentheses that may appear inside these quoted strings. Good luck. *

Sign up to request clarification or add additional context in comments.

3 Comments

really great answer. Thanks a lot! One last question, how do I capture the two strings I was capturing with the repeat groups. __("Show in English", array(), 'page') => [1]Show in English, [2]page
@fesja - You're welcome. Your follow up question represents an entirely new question. When dealing with regex, you need to be very specific with your questions and provide example input (both matching and non-matching) as well as the desired output.
thanks, although in fact I added that example input and output on the question (see second example). Any reference you could give would be great!
1

Matching balanced parentheses is not possible with regular expressions (unless you use an engine with non-standard non-regular extensions, but even then it's still a bad idea and will be hard to maintain).

You could use a regular expression to find lines containing potential matches, then iterate over the string character by character counting the number of open and close parentheses until you find the index of the matching closing parenthesis.

Comments

0

The only way I'm aware of pulling this off is with balanced group definitions. That's a feature in the .NET flavor of regular expressions, and is explained very well in this article.

And as Qtax noted, this can be done in PCRE with (?R) as decribed in their documentation.

Or this could also be accomplished by writing a custom parser. Basically the idea would be to maintain a variable called ParenthesesCount as you're parsing from left to right. You'd increment ParenthesesCount every time you see ( and decrement for every ). I've written a parser recently that handles nested parentheses this way.

2 Comments

It's even easier in PCRE than .NET regex, see (?R).
@Qtax - Interesting. The regular expression can become pretty hard to understand once you dive into some of these more advanced features. But it's neat the way PCRE solves the problem. For the OP, this page describes the (?R) recursive pattern... pcre.org/pcre.txt
0

for me use such expression

(\(([^()]+)\))

i try find it

 * 1) (1+2)
 * 2) (1+2)+(3+2)
 * 3) (IF 1 THEN 1 ELSE 0) > (IF 2 THEN 1 ELSE 1)
 * 4) (1+2) -(4+ (3+2))
 * 5) (1+2) -((4+ (3+2)-(6-7)))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.