8

I'm trying to parse string in the following format (EBNF, I hope this is right) in PHP:

<exp>      ::= <base>[{<modifier>["!"]"("<exp>")"}]
<base>     ::= <role>[{<modifier><role>}]
<modifier> ::= "&" | "|"
<role>     ::= ["!"]<str>[","<str>]

Where <str> is any string that would pass [a-zA-Z0-9\-]+

The following are example of patterns that would have to be parsed:

token1
token1&token2
token1|(token2&!token3)
(token1&token2)|(token3&(token4|(!token5,12&token6)))
!(token1&token2|(token3&!token4))|token5,12

I am trying to write a RegEx pattern that would always give me four groups:

  1. The left-most <expression>. From the above example this would be:
    • token1
    • token1
    • token1
    • token1&token2
    • token1&token2|(token3&!token4)
  2. If ["!"] was present. I.e.
    • null
    • null
    • null
    • null
    • !
  3. The <modifier> for the next <expression> (if any). This would be:
    • null
    • &
    • |
    • |
    • |
  4. The remaining of the pattern.
    • null
    • token2
    • token2&!token3
    • token3&(token4|(!token5,12&token6))
    • token5,12

I can parse this provided that the first expression doesn't contain any <modifier>s.

^\(?(!?)([a-zA-Z0-9\-]+)\)?([&|]?)(.*)$

I am stuck at this point. I have tried using lookarounds, however I can't figure out how to ensure that the group is captured when all brackets are balanced. Is this achievable with RegEx or do I need to write code using loops etc. to do this?

2
  • +1 also, nice job, will be following this for responses Commented Jul 7, 2012 at 20:49
  • You can almost literally translate your description into a PCRE ?(DEFINE) block. However, PCRE only allows matching, not parsing. You won't get the split up result list as you'd want. (Unless you also use a recursive preg_replace_callback to collect all tokens.) Commented Jul 7, 2012 at 21:05

1 Answer 1

1

As far as I know, it is impossible.

You have a context-free grammar (EBNF is for this type of grammars - Type-2 grammars), which cannot be parsed with regular expressions (which are for regular grammars - Type-3 grammars).

http://en.wikipedia.org/wiki/Chomsky_hierarchy

As an example of the thing you cannot handle here: number of opening parantheses - you can only write one regexp for each number of these (but there can be infinite, right?), otherwise there is no way to tell if the number of matching closing parantheses is the same. There is no way to count how many chars mathed by the specific part of regexp with quantifiers (+, *, etc.)

Sign up to request clarification or add additional context in comments.

4 Comments

PHP's regexes do allow recursion; but I would recommend against it.
@Evert Yes, but this doesn't help. Example: "$m=array(); preg_match('/ < ( (?>[a-z]+) | (?R) )* > /x', '<abc<<>', $m); var_dump($m);" - this works because regexp doesn't match the whole string - no ^ and $. BUT, if we add ^ and $, it would NOT work even for '<abc<>>' as input string, because (?R) now contains ^ and $. I hope this explanation is comprehensible.
I have looked into the recursion but it seems like a really messy way to do it. Instead I am now writing my own parser to deal with this.. Thanks :)
Not correct. Just because they are called regular expressions, does not mean they are limited to regular languages. And FYI you can also invoke a particular subexpression recursively (?1) thus excluding ^ and $.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.