I think you need to write a parser that understands enough of the language to at least canonicalize it before you try extracting information. You could write a simple parser by hand, or you could use a parsing framework such as PLY or others of that ilk.
To give you a more concrete idea about what I'm suggesting, consider
the following code, which defines a parse_data function that, given
the contents of a file, will yield a series of tokens recognized in
that file:
import re
tokens = {
'lparen': '\(',
'rparen': '\)',
'comma': ',',
'semicolon': ';',
'whitespace': '\s+',
'equals': '=',
'identifier': '[.\d\w]+',
}
tokens = dict((k, re.compile(v)) for k,v in tokens.items())
def parse_data(data):
while data:
for tn, tv in tokens.items():
mo = tv.match(data)
if mo:
matched = data[mo.start():mo.end()]
data = data[mo.end():]
yield tn, matched
Using this, you could write something that would put your sample input
into canonical form:
with open('inputfile') as fd:
data = fd.read()
last_token = (None, None)
for tn, tv in parse(data):
if tn == 'whitespace' and last_token[0] != 'semicolon':
print ' ',
elif tn == 'whitespace':
pass
elif tn == 'semicolon' and last_token[0] == 'rparen':
print tv
else:
print tv,
last_token = (tn, tv)
Given input like this:
module1 inst1 ( .input a,
.output b;
port b=a;);
module2 inst2 ( .input a, .output b; port b=a;);
module3 inst3 ( .input a, .output b;
port b=a;);
The above code would yield:
module1 inst1 ( .input a , .output b ; port b = a ; ) ;
module2 inst2 ( .input a , .output b ; port b = a ; ) ;
module3 inst3 ( .input a , .output b ; port b = a ; ) ;
Which, because it is in standard form, would be much more amendable to
extracting information via simple pattern matching.
Note that while this code relies on reading the entire source file
into memory first, you could fairly easily write code that you parse a
file in fragments if you were concerned about memory utilization.