2

I have a source code in Fortran (almost irrelevant) and I want to parse the function names and arguments.

eg using

(\w+)\([^\(\)]+\)

with

a(b(1 + 2 * 2), c(3,4))

I get the following: (as expected)

b, 1 + 2 * 2
c, 3,4

where I would need

a, b(1 + 2 * 2), c(3,4)
b, 1 + 2 * 2
c, 3,4

Any suggestions?

Thanks for your time...

1
  • I don't think you will be able to do this using only regular expressions. As their name suggests, they are only capable of handling regular expressions; I'm not up enough on language theory to know where Fortran falls, but my guess is you'll need a skeleton parser to do this right. Commented Mar 12, 2009 at 8:55

5 Answers 5

2

It can be done with regular expressions-- use them to tokenize the string, and work with the tokens. i.e. see re.Scanner. Alternatively, just use pyparsing.

Sign up to request clarification or add additional context in comments.

3 Comments

Right, you can tokenize and use your own state machine you can do it, but that's technically not just using regular expressions.
Thing is that I didn't see "just regex" in the question, only "regex". The re module includes a Scanner, and has for ages-- not that it's documented (bleh).
Oooh cool, undocumented features! +1 from me for telling me to look. :-)
2

This is a nonlinear grammar -- you need to be able to recurse on a set of allowed rules. Look at pyparsing to do simple CFG (Context Free Grammar) parsing via readable specifications.

It's been a while since I've written out CFGs, and I'm probably rusty, so I'll refer you to the Python EBNF to get an idea of how you can construct one for a subset of a language syntax.

Edit: If the example will always be simple, you can code a small state machine class/function that iterates over the tokenized input string, as @Devin Jeanpierre suggests.

Comments

2

You can take a look at PLY (Python Lex-Yacc), it's (in my opinion) very simple to use and well documented, and it comes with a calculator example which could be a good starting point.

Comments

2

I don't think this is a job for regular expressions... they can't really handle nested patterns.

This is because regexes are compiled into FSMs (Finite State Machines). In order to parse arbitrarily nested expressions, you can't use a FSM, because you need infinitely many states to keep track of the arbitrary nesting. Also see this SO thread.

Comments

1

You can't do this with regular expression only. It's sort of recursive. You should match first the most external function and its arguments, print the name of the function, then do the same (match the function name, then its arguments) with all its arguments. Regex alone are not enough.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.