pyparsing - parse xml comment

Question

I need to parse a file containing xml comments. Specifically it's a c# file using the MS /// convention.

From this I'd need to pull out foobar, or /// foobar would be acceptable, too. (Note - this still doesn't work if you make the xml all on one line...)

testStr = """
    ///<summary>
    /// foobar
    ///</summary>
    """

Here is what I have:

import pyparsing as pp

_eol = pp.Literal("\n").suppress()
_cPoundOpenXmlComment = Suppress('///<summary>') + pp.SkipTo(_eol)
_cPoundCloseXmlComment = Suppress('///</summary>') + pp.SkipTo(_eol)
_xmlCommentTxt = ~_cPoundCloseXmlComment + pp.SkipTo(_eol)
xmlComment = _cPoundOpenXmlComment + pp.OneOrMore(_xmlCommentTxt) + _cPoundCloseXmlComment

match = xmlComment.scanString(testStr)

and to output:

for item,start,stop in match:
    for entry in item:
        print(entry)

But I haven't had much success with the grammer working across multi-line.

(note - I tested the above sample in python 3.2; it works but (per my issue) does not print any values)

Thanks!

PaulMcG · Accepted Answer · 2011-10-20 01:46:33Z

3

I think Literal('\n') is your problem. You don't want to build a Literal with whitespace characters (since Literals by default skip over whitespace before trying to match). Try using LineEnd() instead.

EDIT 1: Just because you get an infinite loop with LineEnd doesn't mean that Literal('\n') is any better. Try adding .setDebug() on the end of your _eol definition, and you'll see that it never matches anything.

Instead of trying to define the body of your comment as "one or more lines that are not a closing line, but get everything up to the end-of-line", what if you just do:

xmlComment = _cPoundOpenXmlComment + pp.SkipTo(_cPoundCloseXmlComment) + _cPoundCloseXmlComment

(The reason you were getting an infinite loop with LineEnd() was that you were essentially doing OneOrMore(SkipTo(LineEnd())), but never consuming the LineEnd(), so the OneOrMore just kept matching and matching and matching, parsing and returning an empty string since the parsing position was at the end of line.)

edited Oct 20, 2011 at 1:46

answered Oct 19, 2011 at 19:02

PaulMcG

64.1k16 gold badges98 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

some bits flipped Over a year ago

thanks for the suggestion; however changing to _eol=pp.LineEnd().suppress() results in a hang/inf loop. Could you be a litte more specific (Note - paste the 3 sections together in one .py file and the code runs as-is). Thanks,Mike

some bits flipped Over a year ago

vote up for the explanation of what is wrong. Duh! I should have seen that I never consumed the end of line :)

unutbu · Accepted Answer · 2011-10-19 19:51:42Z

2

How about using nestedExpr:

import pyparsing as pp

text = '''\
///<summary>
/// foobar
///</summary>
blah blah
///<summary> /// bar ///</summary>
///<summary>  ///<summary> /// baz  ///</summary> ///</summary>    
'''

comment=pp.nestedExpr("///<summary>","///</summary>")
for match in comment.searchString(text):
    print(match)
    # [['///', 'foobar']]
    # [['///', 'bar']]
    # [[['///', 'baz']]]

answered Oct 19, 2011 at 19:51

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

1 Comment

some bits flipped Over a year ago

@PaulMcGuire's solution would work, too, but this is exactly what I should be using (it's the simplest...) Thansk!

jfs · Accepted Answer · 2011-10-19 22:51:10Z

1

You could use an xml parser to parse xml. It should be easy to extract relevant comment lines:

import re
from xml.etree import cElementTree as etree

# extract all /// lines
lines = re.findall(r'^\s*///(.*)', text, re.MULTILINE)

# parse xml
root = etree.fromstring('<root>%s</root>' % ''.join(lines))
print root.findtext('summary')
# -> foobar

answered Oct 19, 2011 at 22:51

jfs

417k210 gold badges1k silver badges1.7k bronze badges

3 Comments

PaulMcG Over a year ago

I thought you were great in Blade Runner.

some bits flipped Over a year ago

@JFSebastian Unfortunately this wouldn't work in the bigger picture I'm encountering this problem in. yes, I could extract all the xml fragments as you suggest, but I need to also parse source code after the comment, and a grammer is ~necessary for that; doing the regex search line by line would add an additional loop through the file.

jfs Over a year ago

@mike: the regex is just an example how to extract comment lines. In the bigger picture you use your parser to extract relevant comments (much simpler task than parsing xml) and it doesn't prevent you from using xml parser to parse xml if you find it necessary.

Collectives™ on Stack Overflow

pyparsing - parse xml comment

3 Answers 3

2 Comments

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related