How To Optimize This Regular Expression Pattern

Question

I have a bunch of patterns that I need to find in string, and they are as follows:

<dyn type="dataFrame" name="Main Map" property="reference scale"/>
<dyn type="dataFrame" name="Main Map" property="time"/>
<dyn type="page" property="name"/>
<dyn type="page" property="number"/>
<dyn type="page" property="index"/>
<dyn type="page" property="count"/>
<dyn type="page" property="attribute" field="<Field Name>" domainlookup="true"/>
<dyn type="page" property="attribute" field="<Field Name>" />

Example Usage:

Page <dyn type="page" property="index"/> of <dyn type="page" property="count"/>

which would result in

Page 1 of 15

I planned on using the regex of:

<dyn[^>]*/>

This would give:

regex = re.compile("<dyn[^>]*/>")
string = """Page <dyn type="page" property="index"/> of <dyn type="page" property="count"/>"""
r = regex.search(string)
print regex.findall(string)
[u'<dyn type="page" property="index"/>', u'<dyn type="page" property="count"/>']

but I don't know if it the best pattern to use (I'm convinced there is a better way). This will find all patterns with the pattern, but not the properties inside the tags. Is there a way to write the regex in way that I can push the values to a dictionary object with all the values inside the <> as keys and the values after the = sign?

I just think there is a better way to do this, and since I'm not a wiz bang as regex, I figure I'd ask the community.

Thank you

@AvinashRaj - I'd like a Key/Value pair of the values inside the brackets. Example: {'type':'page', 'property':'index'}, etc... — code base 5000
– code base 5000, Commented Dec 10, 2014 at 14:51

alecxe · Accepted Answer · 2014-12-10 14:37:52Z

2

Use an XML parser, like built-in xml.etree.ElementTree.

Example:

import xml.etree.ElementTree as ET

data = """
<root>
    <dyn type="dataFrame" name="Main Map" property="reference scale"/>
    <dyn type="dataFrame" name="Main Map" property="time"/>
    <dyn type="page" property="name"/>
    <dyn type="page" property="number"/>
    <dyn type="page" property="index">1</dyn>
    <dyn type="page" property="count">15</dyn>
    <dyn type="page" property="attribute" field="Field Name" domainlookup="true"/>
    <dyn type="page" property="attribute" field="Field Name" />
</root>
"""

root = ET.fromstring(data)
index = root.findtext('.//dyn[@property="index"]')
count = root.findtext('.//dyn[@property="count"]')

print "%s of %s" % (index, count)

Prints 1 of 15.

Note that the example is artificial since I'm not sure what your real XML input is. The idea, though, stays the same - an XML parser.

answered Dec 10, 2014 at 14:37

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

code base 5000 Over a year ago

Thank you for this, but I don't think this gets to my point. The tag formats are presented like this in any given string. So something like "I walked to my house to make <dyn type="page" property="attribute" field="TEAFIELD"/> tea. It was tasty". So since it not pure XML, I don't think this will work? Am I wrong here?

alecxe Over a year ago

@josh1234 you are right, xml.etree.ElementTree would not extract the XML part from the text.

vks · Accepted Answer · 2014-12-10 15:25:51Z

0

(\S+)="([^"]+)"

Try this.Grab the capture.See demo.

https://regex101.com/r/nL5yL3/42

Make group 1 as key and group 2 as value.

edited Dec 10, 2014 at 15:25

answered Dec 10, 2014 at 14:37

vks

68.1k11 gold badges96 silver badges132 bronze badges

2 Comments

code base 5000 Over a year ago

this is very close to what I need, I find if you use the expression as a sentence, I get odd results. See: regex101.com/r/qB0jV1/1 . The results show extra text on the left side of the key, is there a way to eliminate that?

code base 5000 Over a year ago

Awesome job! This does exactly what I'm looking for, I knew there was a better way! Thank you for all your help! I shall use this pattern: (\S+)="([^"]+)"

PaulMcG · Accepted Answer · 2014-12-11 03:15:45Z

Cracking XML or HTML with regular expressions can be an exercise in futility. Pyparsing includes an expression-builder helper method, makeHTMLTags, that will make very real-world tolerant parsers and will generate dict-like return values.

from pyparsing import *

dynTag,endDyn = makeHTMLTags("dyn")


sample = """
<dyn type="dataFrame" name="Main Map" property="reference scale"/>
<dyn type="dataFrame" name="Main Map" property="time"/>
<dyn type="page" property="name"/>
<dyn type="page" property="number"/>
<dyn type="page" property="index"/>
<dyn type="page" property="count"/>
<dyn type="page" property="attribute" field="<Field Name>" domainlookup="true"/>
<dyn type="page" property="attribute" field="<Field Name>" />
"""

import pprint
for dyn in dynTag.searchString(sample):
    pprint.pprint(dyn.asDict())
    if "domainlookup" in dyn:
        print "domainlookup =",dyn.domainlookup
    print

Parsing your sample gives:

{'empty': True,
 'name': 'Main Map',
 'property': 'reference scale',
 'startDyn': (['dyn', (['type', 'dataFrame'], {}), (['name', 'Main Map'], {}), (['property', 'reference scale'], {}), True], {'type': [('dataFrame', 1)], 'property': [('reference scale', 3)], 'tag': [('dyn', 0)], 'name': [('Main Map', 2)], 'empty': [(True, 4)]}),
 'tag': 'dyn',
 'type': 'dataFrame'}

{'empty': True,
 'name': 'Main Map',
 'property': 'time',
 'startDyn': (['dyn', (['type', 'dataFrame'], {}), (['name', 'Main Map'], {}), (['property', 'time'], {}), True], {'type': [('dataFrame', 1)], 'property': [('time', 3)], 'tag': [('dyn', 0)], 'name': [('Main Map', 2)], 'empty': [(True, 4)]}),
 'tag': 'dyn',
 'type': 'dataFrame'}

{'empty': True,
 'property': 'name',
 'startDyn': (['dyn', (['type', 'page'], {}), (['property', 'name'], {}), True], {'type': [('page', 1)], 'property': [('name', 2)], 'tag': [('dyn', 0)], 'empty': [(True, 3)]}),
 'tag': 'dyn',
 'type': 'page'}

{'empty': True,
 'property': 'number',
 'startDyn': (['dyn', (['type', 'page'], {}), (['property', 'number'], {}), True], {'type': [('page', 1)], 'property': [('number', 2)], 'tag': [('dyn', 0)], 'empty': [(True, 3)]}),
 'tag': 'dyn',
 'type': 'page'}

{'empty': True,
 'property': 'index',
 'startDyn': (['dyn', (['type', 'page'], {}), (['property', 'index'], {}), True], {'type': [('page', 1)], 'property': [('index', 2)], 'tag': [('dyn', 0)], 'empty': [(True, 3)]}),
 'tag': 'dyn',
 'type': 'page'}

{'empty': True,
 'property': 'count',
 'startDyn': (['dyn', (['type', 'page'], {}), (['property', 'count'], {}), True], {'type': [('page', 1)], 'property': [('count', 2)], 'tag': [('dyn', 0)], 'empty': [(True, 3)]}),
 'tag': 'dyn',
 'type': 'page'}

{'domainlookup': 'true',
 'empty': True,
 'field': '<Field Name>',
 'property': 'attribute',
 'startDyn': (['dyn', (['type', 'page'], {}), (['property', 'attribute'], {}), (['field', '<Field Name>'], {}), (['domainlookup', 'true'], {}), True], {'field': [('<Field Name>', 3)], 'tag': [('dyn', 0)], 'domainlookup': [('true', 4)], 'property': [('attribute', 2)], 'type': [('page', 1)], 'empty': [(True, 5)]}),
 'tag': 'dyn',
 'type': 'page'}
domainlookup = true

{'empty': True,
 'field': '<Field Name>',
 'property': 'attribute',
 'startDyn': (['dyn', (['type', 'page'], {}), (['property', 'attribute'], {}), (['field', '<Field Name>'], {}), True], {'field': [('<Field Name>', 3)], 'property': [('attribute', 2)], 'tag': [('dyn', 0)], 'empty': [(True, 4)], 'type': [('page', 1)]}),
 'tag': 'dyn',
 'type': 'page'}

Note that the resulting ParseResults structures will let you access the parsed attributes like object attributes (dyn.domainlookup) or dict keys (dyn["domainlookup"]).

The <dyn.../> are all the valid patterns I'm looking now, not the way the data is presented.
@josh1234 - what does that mean? Looking at your answer to alecxe, I think pyparsing may still be a good choice. Are you saying that you just want to pick out the <dyn.../> tags for now? You can use pyparsing's originalTextFor method to give you that, just use originalTextFor(dynTag) and you will get the matched XML text. But the next thing you will do is process the XML to get the fields and properties; the answer I've posted already takes that additional step. Also, pyparsing includes a transformString method that will do inplace sub of these XML strings with replacement values.

Collectives™ on Stack Overflow

How To Optimize This Regular Expression Pattern

3 Answers 3

2 Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related