Find function calls with python using Regular expressions

Question

I am working with a language where the modules are defined as

<module_name> <inst_name>(.<port_name> (<net_name>)….);

or

module1 inst1 ( .input a,
.output b;
port b=a;);

I want to find all such modules, while ignoring function calls .

I'm having difficulty with regex. I am looking for this

 text1 text2 ( .text3; text4 );

note that all the spaces except the ones between text 1 and text2 are optional and might be new lines instead of spaces.text 3 and text4 can be multi lines but all are in the form of

text3 - >
.blah1 (blah2),
.blah3 (blah4)

text4->
blah1 blah2=xyz;
blah3 blah4=qwe;

I am trying to do

 re.split(r"^[a-zA-Z]*\s[a-zA-Z]*\s?\n?\([a-zA-Z]*\s?\n?;[a-zA-Z]*\);", data)

Doesn't work though.It just grabs everything. How do i fix it? Thanks !! I do need to grab everything individually, eventually (module/instances/port/nets). I think I can split it once regex is working.

Python's regex engine can't match nested structures (and you have nested parentheses in your data). You should probably implement a parser for this anyway. — Lucas Trzesniewski
– Lucas Trzesniewski, Commented Feb 17, 2015 at 20:48

larsks · Accepted Answer · 2015-02-17 21:47:39Z

1

I think you need to write a parser that understands enough of the language to at least canonicalize it before you try extracting information. You could write a simple parser by hand, or you could use a parsing framework such as PLY or others of that ilk.

To give you a more concrete idea about what I'm suggesting, consider the following code, which defines a parse_data function that, given the contents of a file, will yield a series of tokens recognized in that file:

import re

tokens = {
    'lparen': '\(',
    'rparen': '\)',
    'comma': ',',
    'semicolon': ';',
    'whitespace': '\s+',
    'equals': '=',
    'identifier': '[.\d\w]+',
}

tokens = dict((k, re.compile(v)) for k,v in tokens.items())

def parse_data(data):
    while data:
        for tn, tv in tokens.items():
            mo = tv.match(data)
            if mo:
                matched = data[mo.start():mo.end()]
                data = data[mo.end():]
                yield tn, matched

Using this, you could write something that would put your sample input into canonical form:

with open('inputfile') as fd:
    data = fd.read()

last_token = (None, None)
for tn, tv in parse(data):
    if tn == 'whitespace' and last_token[0] != 'semicolon':
        print ' ',
    elif tn == 'whitespace':
        pass
    elif tn == 'semicolon' and last_token[0] == 'rparen':
        print tv
    else:
        print tv,

    last_token = (tn, tv)

Given input like this:

module1 inst1 ( .input a,
.output b;
port b=a;);
module2 inst2 ( .input a, .output b; port b=a;);

module3 inst3 ( .input a, .output b;


port b=a;);

The above code would yield:

module1   inst1   (   .input   a ,   .output   b ; port   b = a ; ) ;
module2   inst2   (   .input   a ,   .output   b ; port   b = a ; ) ;
module3   inst3   (   .input   a ,   .output   b ; port   b = a ; ) ;

Which, because it is in standard form, would be much more amendable to extracting information via simple pattern matching.

Note that while this code relies on reading the entire source file into memory first, you could fairly easily write code that you parse a file in fragments if you were concerned about memory utilization.

answered Feb 17, 2015 at 21:47

larsks

318k50 gold badges474 silver badges482 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Illusionist Over a year ago

The code seems to get stuck and doesn't ever return, the file is about ~1000 lines.

larsks Over a year ago

Well, (a) this was meant as a suggestion of a direction in which you should investigate, not as a complete solution, and (b) it was only tested against the sample input you provided. So it's not too surprising that something isn't working; your input file undoubtedly has content not accounted for in your sample input.

Illusionist Over a year ago

Thanks larsks, the problem comes from having multiple bracketted lines in the function call module inst4( .input a (output b), .input2 (outputc), .input3 (outputd)); > Trying to resolve it now, thanks for your help !

Collectives™ on Stack Overflow

Find function calls with python using Regular expressions

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related