1

I have to parse an input string in python and extract certain parts from it.

the format of the string is

(xx,yyy,(aa,bb,...)) // Inner parenthesis can hold one or more characters in it

I want a function to return xx, yyyy and a list containing aa, bb ... etc

I can ofcourse do it by trying to split of the parenthesis and stuff but I want to know if there a proper pythonic way of extracting such info from a string

I have this code which works, but is there a better way to do it (without regex)

def processInput(inputStr):
    value = inputStr.strip()[1:-1]
    parts = value.split(',', 2)
    return parts[0], parts[1], (parts[2].strip()[1:-1]).split(',')
1
  • 1
    If the inner values were quoted you could actually just eval() it, although I certainly wouldn't recommend it :) Commented Jul 1, 2010 at 2:33

6 Answers 6

3

If you're allergic to REs, you could use pyparsing:

>>> import pyparsing as p
>>> ope, clo, com = map(p.Suppress, '(),')
>>> w = p.Word(p.alphas)
>>> s = ope + w + com + w + com + ope + p.delimitedList(w) + clo + clo
>>> x = '(xx,yyy,(aa,bb,cc))'
>>> list(s.parseString(x))
['xx', 'yyy', 'aa', 'bb', 'cc']

pyparsing also makes it easy to control the exact form of results (e.g. by grouping the last 3 items into their own sublist), if you want. But I think the nicest aspect is how natural (depending on how much space you want to devote to it) you can make the "grammar specification" read: an open paren, a word, a comma, a word, a comma, an open paren, a delimited list of words, two closed parentheses (if you find the assignment to s above not so easy to read, I guess it's my fault for not choosing longer identifiers;-).

Sign up to request clarification or add additional context in comments.

2 Comments

Alex, you silver-tongued devil! I think we probably posted within a minute of each other!
@Paul, yep - your post wasn't there as I started writing mine and I'm pretty sure the reverse also holds, so we must have been writing them pretty much at the same time!
3

If your parenthesis nesting can be arbitrarily deep, then regexen won't do, you'll need a state machine or a parser. Pyparsing supports recursive grammars using forward-declaration class Forward:

from pyparsing import *

LPAR,RPAR,COMMA = map(Suppress,"(),")
nestedParens = Forward()
listword = Word(alphas) | '...'
nestedParens << Group(LPAR + delimitedList(listword | nestedParens) + RPAR)

text = "(xx,yyy,(aa,bb,...))"
results = nestedParens.parseString(text).asList()
print results

text = "(xx,yyy,(aa,bb,(dd,ee),ff,...))"
results = nestedParens.parseString(text).asList()
print results

Prints:

[['xx', 'yyy', ['aa', 'bb', '...']]]
[['xx', 'yyy', ['aa', 'bb', ['dd', 'ee'], 'ff', '...']]]

1 Comment

+1 because it shows off a couple more advanced features of pyparsing while I was sticking with the very basics;-)
2

Let's use regular expressions!

/\(([^,]+),([^,]+),\(([^)]+)\)\)/

Match against that, first capturing group contains xx, second contains yyy, split the third on , and you have your list.

2 Comments

using regex is definitely one good way, is there anyway to create an expression like sortof a reverse printf and use that to extract required parts?
There's a sscanf function in C, but I don't know whether Python has an equivalent in its standard library. Maybe somebody's implemented it in a third-party library.
2

How about like this?

>>> import ast
>>> import re
>>>
>>> s="(xx,yyy,(aa,bb,ccc))"
>>> x=re.sub("(\w+)",'"\\1"',s)
# '("xx","yyy",("aa","bb","ccc"))'
>>> ast.literal_eval(x)
('xx', 'yyy', ('aa', 'bb', 'ccc'))
>>>

Comments

1

I don't know that this is better, but it's a different way to do it. Using the regex previously suggested

 def processInput(inputStr):
        value = [re.sub('\(*\)*','',i) for i in inputStr.split(',')]
        return value[0], value[1], value[2:]

Alternatively, you could use two chained replace functions in lieu of the regex.

Comments

0

Your solution is decent (simple, efficient). You could use regular expressions to restrict the syntax if you don't trust your data source.

import re
parser_re = re.compile(r'\(([^,)]+),([^,)]+),\(([^)]+)\)')
def parse(input):
    m = parser_re.match(input)
    if m:
        first = m.group(1)
        second = m.group(2)
        rest = m.group(3).split(",")
        return (first, second, rest)
    else:
        return None

print parse( '(xx,yy,(aa,bb,cc,dd))' )
print parse( 'xx,yy,(aa,bb,cc,dd)' ) # doesn't parse, returns None

# can use this to unpack the various parts.
# first,second,rest = parse(...)

Prints:

('xx', 'yy', ['aa', 'bb', 'cc', 'dd'])
None

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.