(PHP) Parsing RegEx string - balancing brackets

Question

I'm trying to parse string in the following format (EBNF, I hope this is right) in PHP:

<exp>      ::= <base>[{<modifier>["!"]"("<exp>")"}]
<base>     ::= <role>[{<modifier><role>}]
<modifier> ::= "&" | "|"
<role>     ::= ["!"]<str>[","<str>]

Where <str> is any string that would pass [a-zA-Z0-9\-]+

The following are example of patterns that would have to be parsed:

token1
token1&token2
token1|(token2&!token3)
(token1&token2)|(token3&(token4|(!token5,12&token6)))
!(token1&token2|(token3&!token4))|token5,12

I am trying to write a RegEx pattern that would always give me four groups:

The left-most <expression>. From the above example this would be:
- token1
- token1
- token1
- token1&token2
- token1&token2|(token3&!token4)
If ["!"] was present. I.e.
- null
- null
- null
- null
- !
The <modifier> for the next <expression> (if any). This would be:
- null
- &
- |
- |
- |
The remaining of the pattern.
- null
- token2
- token2&!token3
- token3&(token4|(!token5,12&token6))
- token5,12

I can parse this provided that the first expression doesn't contain any <modifier>s.

^\(?(!?)([a-zA-Z0-9\-]+)\)?([&|]?)(.*)$

I am stuck at this point. I have tried using lookarounds, however I can't figure out how to ensure that the group is captured when all brackets are balanced. Is this achievable with RegEx or do I need to write code using loops etc. to do this?

You can almost literally translate your description into a PCRE ?(DEFINE) block. However, PCRE only allows matching, not parsing. You won't get the split up result list as you'd want. (Unless you also use a recursive preg_replace_callback to collect all tokens.) — mario
– mario, Commented Jul 7, 2012 at 21:05

scriptin · Accepted Answer · 2012-07-07 20:54:21Z

1

As far as I know, it is impossible.

You have a context-free grammar (EBNF is for this type of grammars - Type-2 grammars), which cannot be parsed with regular expressions (which are for regular grammars - Type-3 grammars).

http://en.wikipedia.org/wiki/Chomsky_hierarchy

As an example of the thing you cannot handle here: number of opening parantheses - you can only write one regexp for each number of these (but there can be infinite, right?), otherwise there is no way to tell if the number of matching closing parantheses is the same. There is no way to count how many chars mathed by the specific part of regexp with quantifiers (+, *, etc.)

answered Jul 7, 2012 at 20:54

scriptin

3,1701 gold badge28 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Evert Over a year ago

PHP's regexes do allow recursion; but I would recommend against it.

scriptin Over a year ago

@Evert Yes, but this doesn't help. Example: "$m=array(); preg_match('/ < ( (?>[a-z]+) | (?R) )* > /x', '<abc<<>', $m); var_dump($m);" - this works because regexp doesn't match the whole string - no ^ and $. BUT, if we add ^ and $, it would NOT work even for '<abc<>>' as input string, because (?R) now contains ^ and $. I hope this explanation is comprehensible.

Bart Platak Over a year ago

I have looked into the recursion but it seems like a really messy way to do it. Instead I am now writing my own parser to deal with this.. Thanks :)

mario Over a year ago

Not correct. Just because they are called regular expressions, does not mean they are limited to regular languages. And FYI you can also invoke a particular subexpression recursively (?1) thus excluding ^ and $.

Collectives™ on Stack Overflow

(PHP) Parsing RegEx string - balancing brackets

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related