extract parts of the string in python

Question

I have to parse an input string in python and extract certain parts from it.

the format of the string is

(xx,yyy,(aa,bb,...)) // Inner parenthesis can hold one or more characters in it

I want a function to return xx, yyyy and a list containing aa, bb ... etc

I can ofcourse do it by trying to split of the parenthesis and stuff but I want to know if there a proper pythonic way of extracting such info from a string

I have this code which works, but is there a better way to do it (without regex)

def processInput(inputStr):
    value = inputStr.strip()[1:-1]
    parts = value.split(',', 2)
    return parts[0], parts[1], (parts[2].strip()[1:-1]).split(',')

If the inner values were quoted you could actually just eval() it, although I certainly wouldn't recommend it :) — Michael Mrozek
– Michael Mrozek, Commented Jul 1, 2010 at 2:33

Alex Martelli · Accepted Answer · 2010-07-01 04:17:35Z

3

If you're allergic to REs, you could use pyparsing:

>>> import pyparsing as p
>>> ope, clo, com = map(p.Suppress, '(),')
>>> w = p.Word(p.alphas)
>>> s = ope + w + com + w + com + ope + p.delimitedList(w) + clo + clo
>>> x = '(xx,yyy,(aa,bb,cc))'
>>> list(s.parseString(x))
['xx', 'yyy', 'aa', 'bb', 'cc']

pyparsing also makes it easy to control the exact form of results (e.g. by grouping the last 3 items into their own sublist), if you want. But I think the nicest aspect is how natural (depending on how much space you want to devote to it) you can make the "grammar specification" read: an open paren, a word, a comma, a word, a comma, an open paren, a delimited list of words, two closed parentheses (if you find the assignment to s above not so easy to read, I guess it's my fault for not choosing longer identifiers;-).

answered Jul 1, 2010 at 4:17

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

PaulMcG Over a year ago

Alex, you silver-tongued devil! I think we probably posted within a minute of each other!

Alex Martelli Over a year ago

@Paul, yep - your post wasn't there as I started writing mine and I'm pretty sure the reverse also holds, so we must have been writing them pretty much at the same time!

PaulMcG · Accepted Answer · 2010-07-01 03:34:41Z

3

If your parenthesis nesting can be arbitrarily deep, then regexen won't do, you'll need a state machine or a parser. Pyparsing supports recursive grammars using forward-declaration class Forward:

from pyparsing import *

LPAR,RPAR,COMMA = map(Suppress,"(),")
nestedParens = Forward()
listword = Word(alphas) | '...'
nestedParens << Group(LPAR + delimitedList(listword | nestedParens) + RPAR)

text = "(xx,yyy,(aa,bb,...))"
results = nestedParens.parseString(text).asList()
print results

text = "(xx,yyy,(aa,bb,(dd,ee),ff,...))"
results = nestedParens.parseString(text).asList()
print results

Prints:

[['xx', 'yyy', ['aa', 'bb', '...']]]
[['xx', 'yyy', ['aa', 'bb', ['dd', 'ee'], 'ff', '...']]]

answered Jul 1, 2010 at 3:34

PaulMcG

64.1k16 gold badges98 silver badges135 bronze badges

1 Comment

Alex Martelli Over a year ago

+1 because it shows off a couple more advanced features of pyparsing while I was sticking with the very basics;-)

Anon. · Accepted Answer · 2010-07-01 02:32:56Z

2

Let's use regular expressions!

/\(([^,]+),([^,]+),\(([^)]+)\)\)/

Match against that, first capturing group contains xx, second contains yyy, split the third on , and you have your list.

answered Jul 1, 2010 at 2:32

Anon.

60.3k9 gold badges84 silver badges86 bronze badges

2 Comments

randomThought Over a year ago

using regex is definitely one good way, is there anyway to create an expression like sortof a reverse printf and use that to extract required parts?

David Z Over a year ago

There's a sscanf function in C, but I don't know whether Python has an equivalent in its standard library. Maybe somebody's implemented it in a third-party library.

YOU · Accepted Answer · 2010-07-01 02:47:50Z

2

How about like this?

>>> import ast
>>> import re
>>>
>>> s="(xx,yyy,(aa,bb,ccc))"
>>> x=re.sub("(\w+)",'"\\1"',s)
# '("xx","yyy",("aa","bb","ccc"))'
>>> ast.literal_eval(x)
('xx', 'yyy', ('aa', 'bb', 'ccc'))
>>>

answered Jul 1, 2010 at 2:47

YOU

124k34 gold badges192 silver badges222 bronze badges

Comments

dave · Accepted Answer · 2010-07-01 02:43:05Z

1

I don't know that this is better, but it's a different way to do it. Using the regex previously suggested

 def processInput(inputStr):
        value = [re.sub('\(*\)*','',i) for i in inputStr.split(',')]
        return value[0], value[1], value[2:]

Alternatively, you could use two chained replace functions in lieu of the regex.

edited Jul 1, 2010 at 2:43

answered Jul 1, 2010 at 2:36

dave

12.9k10 gold badges45 silver badges60 bronze badges

Comments

James Harr · Accepted Answer · 2010-07-01 05:12:16Z

0

Your solution is decent (simple, efficient). You could use regular expressions to restrict the syntax if you don't trust your data source.

import re
parser_re = re.compile(r'\(([^,)]+),([^,)]+),\(([^)]+)\)')
def parse(input):
    m = parser_re.match(input)
    if m:
        first = m.group(1)
        second = m.group(2)
        rest = m.group(3).split(",")
        return (first, second, rest)
    else:
        return None

print parse( '(xx,yy,(aa,bb,cc,dd))' )
print parse( 'xx,yy,(aa,bb,cc,dd)' ) # doesn't parse, returns None

# can use this to unpack the various parts.
# first,second,rest = parse(...)

Prints:

('xx', 'yy', ['aa', 'bb', 'cc', 'dd'])
None

answered Jul 1, 2010 at 5:12

James Harr

1,9752 gold badges13 silver badges11 bronze badges

Collectives™ on Stack Overflow

extract parts of the string in python

6 Answers 6

2 Comments

1 Comment

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

1 Comment

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related