Inconsistency in string parsing of python

Question

I'm trying to parse strings in python. I have posted a couple of questions on stack overflow and I was basically trying to combine the functionality of all the different possible ways of parsing the strings I am working with.

Here's a code snippet that works just fine in isolation to parse the two following string formats.

from __future__ import generators
from pprint import pprint
s2="<one><two><three> an.attribute ::"
s1="< one > < two > < three > here's one attribute < six : 10.3 > < seven : 8.5 > <   eight :   90.1 > < nine : 8.7 >"
def parse(s):
    for t in s.split('<'):
        for u in t.strip().split('>',1):
            if u.strip(): yield u.strip()
pprint(list(parse(s1)))
pprint(list(parse(s2)))

Here's the output that I get. It's in the format that I need where each attribute is stored in a different index location.

['one',
 'two',
 'three',
 "here's one attribute",
 'six : 10.3',
 'seven : 8.5',
 'eight : 90.1',
 'nine : 8.7']
['one', 'two', 'three', 'an.attribute ::']

After that was done, I tried to incorporate the same code into a function which can parse four string formats but for some reason it doesn't seem to work here and I cant figure out why.

Here's the incorporated code in its entirety.

from __future__ import generators
import re
import string
from pprint import pprint
temp=[]
y=[]
s2="< one > < two > < three > an.attribute ::"
s1="< one > < two > < three > here's an attribute < four : 6.5 > < five : 7.5 > < six : 8.5 > < seven : 9.5 >"
t2="< one > < two > < three > < four : 220.0 > < five : 6.5 > < six : 7.5 > < seven : 8.5 > < eight : 9.5 > < nine : 6 -  7 >"
t3="One : two :  three : four  Value  : five  Value  : six  Value : seven  Value :  eight  Value :"
def parse(s):
    c=s.count('<')
    print c
    if c==9:
        res = re.findall('< (.*?) >', s)
        return res
    elif (c==7|c==3):
        temp=parsing(s)
        pprint(list(temp))
        #pprint(list(parsing(s)))
    else: 
        res=s.split(' : ')
        res = [item.strip() for item in s.split(':')]
        return res
def parsing(s):
    for t in s.split(' < '):
        for u in t.strip().split('>',1):
            if u.strip(): yield u.strip()
    pprint(list((s)))

Now when I compile the code and call parse(s1) I get the following as the output:

7
["< one > < two > < three > here's an attribute < four",
 '6.5 > < five',
 '7.5 > < six',
 '8.5 > < seven',

Similarly, on calling parse(s2), I get:

3
['< one > < two > < three > an.attribute', '', '']
   '9.5 >']

Why is there an inconsistency in spliting the string while it is being parsed? I'm using the same code in both places.

Could someone help me figure out why this is happening? :)

First thing that strikes me - what version of Python are you using!? from __future__ import generators implies anicent — Jon Clements
– Jon Clements, Commented Mar 14, 2013 at 10:00
I'm using PyScripter 2.7. What do you suggest that I use instead? :) @JonClements — Anon
– Anon, Commented Mar 14, 2013 at 10:05
You do not need to use the from __future__ import generators line at all then. — Martijn Pieters
– Martijn Pieters, Commented Mar 14, 2013 at 10:07

Martijn Pieters · Accepted Answer · 2013-03-14 10:08:42Z

2

You are using the binary | bitwise or operator where you should be using the or boolean operator instead:

elif (c==7|c==3):

should be

elif c==7 or c==3:

or perhaps:

elif c in (3, 7):

which is faster to boot.

Because the | operator has a different precedence than the or operator, the first statement was interpreted as (c == (7 | c) == 3) with 7 | c doing a bitwise logical operation, returning a result which is never going to be equal to both c and 3, so that always returns False:

>>> c = 7
>>> (c==7|c==3)
False
>>> c = 3
>>> (c==7|c==3)
False
>>> c==7 or c==3
True

edited Mar 14, 2013 at 10:08

answered Mar 14, 2013 at 10:01

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Anon Over a year ago

Yes that's what was wrong. Thanks so much, I'm new to Python and I'm still finding my way around it =) @Martijn Pieters

Jon Clements Over a year ago

@Paulie It's also worth noting that c==7|c==3|c==9|c==2 etc... Can more Pythonically be written as c in (7, 3, 9, 2) - which is clearer, and makes adding/removing conditions easier

Collectives™ on Stack Overflow

Inconsistency in string parsing of python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related