1

I have a bunch of patterns that I need to find in string, and they are as follows:

<dyn type="dataFrame" name="Main Map" property="reference scale"/>
<dyn type="dataFrame" name="Main Map" property="time"/>
<dyn type="page" property="name"/>
<dyn type="page" property="number"/>
<dyn type="page" property="index"/>
<dyn type="page" property="count"/>
<dyn type="page" property="attribute" field="<Field Name>" domainlookup="true"/>
<dyn type="page" property="attribute" field="<Field Name>" />

Example Usage:

Page <dyn type="page" property="index"/> of <dyn type="page" property="count"/>

which would result in

Page 1 of 15

I planned on using the regex of:

<dyn[^>]*/>

This would give:

regex = re.compile("<dyn[^>]*/>")
string = """Page <dyn type="page" property="index"/> of <dyn type="page" property="count"/>"""
r = regex.search(string)
print regex.findall(string)
[u'<dyn type="page" property="index"/>', u'<dyn type="page" property="count"/>']

but I don't know if it the best pattern to use (I'm convinced there is a better way). This will find all patterns with the pattern, but not the properties inside the tags. Is there a way to write the regex in way that I can push the values to a dictionary object with all the values inside the <> as keys and the values after the = sign?

I just think there is a better way to do this, and since I'm not a wiz bang as regex, I figure I'd ask the community.

Thank you

2
  • What's your expected output? Commented Dec 10, 2014 at 14:36
  • @AvinashRaj - I'd like a Key/Value pair of the values inside the brackets. Example: {'type':'page', 'property':'index'}, etc... Commented Dec 10, 2014 at 14:51

3 Answers 3

2

Use an XML parser, like built-in xml.etree.ElementTree.

Example:

import xml.etree.ElementTree as ET

data = """
<root>
    <dyn type="dataFrame" name="Main Map" property="reference scale"/>
    <dyn type="dataFrame" name="Main Map" property="time"/>
    <dyn type="page" property="name"/>
    <dyn type="page" property="number"/>
    <dyn type="page" property="index">1</dyn>
    <dyn type="page" property="count">15</dyn>
    <dyn type="page" property="attribute" field="Field Name" domainlookup="true"/>
    <dyn type="page" property="attribute" field="Field Name" />
</root>
"""

root = ET.fromstring(data)
index = root.findtext('.//dyn[@property="index"]')
count = root.findtext('.//dyn[@property="count"]')

print "%s of %s" % (index, count)

Prints 1 of 15.

Note that the example is artificial since I'm not sure what your real XML input is. The idea, though, stays the same - an XML parser.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for this, but I don't think this gets to my point. The tag formats are presented like this in any given string. So something like "I walked to my house to make <dyn type="page" property="attribute" field="TEAFIELD"/> tea. It was tasty". So since it not pure XML, I don't think this will work? Am I wrong here?
@josh1234 you are right, xml.etree.ElementTree would not extract the XML part from the text.
0
(\S+)="([^"]+)"

Try this.Grab the capture.See demo.

https://regex101.com/r/nL5yL3/42

Make group 1 as key and group 2 as value.

2 Comments

this is very close to what I need, I find if you use the expression as a sentence, I get odd results. See: regex101.com/r/qB0jV1/1 . The results show extra text on the left side of the key, is there a way to eliminate that?
Awesome job! This does exactly what I'm looking for, I knew there was a better way! Thank you for all your help! I shall use this pattern: (\S+)="([^"]+)"
0

Cracking XML or HTML with regular expressions can be an exercise in futility. Pyparsing includes an expression-builder helper method, makeHTMLTags, that will make very real-world tolerant parsers and will generate dict-like return values.

from pyparsing import *

dynTag,endDyn = makeHTMLTags("dyn")


sample = """
<dyn type="dataFrame" name="Main Map" property="reference scale"/>
<dyn type="dataFrame" name="Main Map" property="time"/>
<dyn type="page" property="name"/>
<dyn type="page" property="number"/>
<dyn type="page" property="index"/>
<dyn type="page" property="count"/>
<dyn type="page" property="attribute" field="<Field Name>" domainlookup="true"/>
<dyn type="page" property="attribute" field="<Field Name>" />
"""

import pprint
for dyn in dynTag.searchString(sample):
    pprint.pprint(dyn.asDict())
    if "domainlookup" in dyn:
        print "domainlookup =",dyn.domainlookup
    print

Parsing your sample gives:

{'empty': True,
 'name': 'Main Map',
 'property': 'reference scale',
 'startDyn': (['dyn', (['type', 'dataFrame'], {}), (['name', 'Main Map'], {}), (['property', 'reference scale'], {}), True], {'type': [('dataFrame', 1)], 'property': [('reference scale', 3)], 'tag': [('dyn', 0)], 'name': [('Main Map', 2)], 'empty': [(True, 4)]}),
 'tag': 'dyn',
 'type': 'dataFrame'}

{'empty': True,
 'name': 'Main Map',
 'property': 'time',
 'startDyn': (['dyn', (['type', 'dataFrame'], {}), (['name', 'Main Map'], {}), (['property', 'time'], {}), True], {'type': [('dataFrame', 1)], 'property': [('time', 3)], 'tag': [('dyn', 0)], 'name': [('Main Map', 2)], 'empty': [(True, 4)]}),
 'tag': 'dyn',
 'type': 'dataFrame'}

{'empty': True,
 'property': 'name',
 'startDyn': (['dyn', (['type', 'page'], {}), (['property', 'name'], {}), True], {'type': [('page', 1)], 'property': [('name', 2)], 'tag': [('dyn', 0)], 'empty': [(True, 3)]}),
 'tag': 'dyn',
 'type': 'page'}

{'empty': True,
 'property': 'number',
 'startDyn': (['dyn', (['type', 'page'], {}), (['property', 'number'], {}), True], {'type': [('page', 1)], 'property': [('number', 2)], 'tag': [('dyn', 0)], 'empty': [(True, 3)]}),
 'tag': 'dyn',
 'type': 'page'}

{'empty': True,
 'property': 'index',
 'startDyn': (['dyn', (['type', 'page'], {}), (['property', 'index'], {}), True], {'type': [('page', 1)], 'property': [('index', 2)], 'tag': [('dyn', 0)], 'empty': [(True, 3)]}),
 'tag': 'dyn',
 'type': 'page'}

{'empty': True,
 'property': 'count',
 'startDyn': (['dyn', (['type', 'page'], {}), (['property', 'count'], {}), True], {'type': [('page', 1)], 'property': [('count', 2)], 'tag': [('dyn', 0)], 'empty': [(True, 3)]}),
 'tag': 'dyn',
 'type': 'page'}

{'domainlookup': 'true',
 'empty': True,
 'field': '<Field Name>',
 'property': 'attribute',
 'startDyn': (['dyn', (['type', 'page'], {}), (['property', 'attribute'], {}), (['field', '<Field Name>'], {}), (['domainlookup', 'true'], {}), True], {'field': [('<Field Name>', 3)], 'tag': [('dyn', 0)], 'domainlookup': [('true', 4)], 'property': [('attribute', 2)], 'type': [('page', 1)], 'empty': [(True, 5)]}),
 'tag': 'dyn',
 'type': 'page'}
domainlookup = true

{'empty': True,
 'field': '<Field Name>',
 'property': 'attribute',
 'startDyn': (['dyn', (['type', 'page'], {}), (['property', 'attribute'], {}), (['field', '<Field Name>'], {}), True], {'field': [('<Field Name>', 3)], 'property': [('attribute', 2)], 'tag': [('dyn', 0)], 'empty': [(True, 4)], 'type': [('page', 1)]}),
 'tag': 'dyn',
 'type': 'page'}

Note that the resulting ParseResults structures will let you access the parsed attributes like object attributes (dyn.domainlookup) or dict keys (dyn["domainlookup"]).

2 Comments

The <dyn.../> are all the valid patterns I'm looking now, not the way the data is presented.
@josh1234 - what does that mean? Looking at your answer to alecxe, I think pyparsing may still be a good choice. Are you saying that you just want to pick out the <dyn.../> tags for now? You can use pyparsing's originalTextFor method to give you that, just use originalTextFor(dynTag) and you will get the matched XML text. But the next thing you will do is process the XML to get the fields and properties; the answer I've posted already takes that additional step. Also, pyparsing includes a transformString method that will do inplace sub of these XML strings with replacement values.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.