1

How do I extract all HTML-style comments from a document, using Python?

I've tried using a regex:

text = 'hello, world <!-- comment -->'
re.match('<!--(.*?)-->', text)

But it produces nothing. I don't understand this since the same regex works fine on the same string at https://regex101.com/

UPDATE: My document is actually an XML file, and I'm parsing the document with pyquery (based on lxml), but I don't think lxml can extract comments that aren't inside a node. This is what the document looks like:

<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="220398">
  <intervention_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm  -->
    <mesh_term>Freund's Adjuvant</mesh_term>
    <mesh_term>Keyhole-limpet hemocyanin</mesh_term>
  </intervention_browse>
  <!-- Results have not yet been posted for this study                                -->
</clinical_study>

UPDATE 2: Thanks for suggesting the other answer, but I'm already parsing the document extensively with lxml and don't want to rewrite everything with BeautifulSoup. Have updated title accordingly.

5
  • 1
    This would be trivial and more reliable using lxml or beautifulsoup Commented Jul 27, 2016 at 14:59
  • @MaxU I'm already using lxml (pyquery) so I don't really want to switch to BeautifulSoup, but thanks. I've updated the question to be clear that I'm happy to use regex or lxml. Commented Jul 27, 2016 at 15:02
  • @Padraic I'm not sure it is actually possible in lxml, see the update. Commented Jul 27, 2016 at 15:02
  • @Richard dox you linked to suggest you can determine whether the tag is an etree.comment -- have you tried that? And then if True could just print the tag property value? Commented Jul 27, 2016 at 15:10
  • @DavidZemens problem is that there is no tag, the comment is just floating. Commented Jul 27, 2016 at 15:11

5 Answers 5

2

This seems to print the comment for me:

from lxml import etree
txt = """<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="220398">
  <intervention_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm  -->
    <mesh_term>Freund's Adjuvant</mesh_term>
    <mesh_term>Keyhole-limpet hemocyanin</mesh_term>
  </intervention_browse>
  <!-- Results have not yet been posted for this study                                -->
</clinical_study>"""
root = etree.XML(txt)
print root[0][0]

enter image description here

To get the last comment:

comments = [itm for itm in root if itm.tag is etree.Comment]:
if comments:
    print comments[-1]
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! It's actually the last comment I care about (in a document with an arbitrary number of comments, though the last comment is always just before the closing clinical_study tag) any idea how you'd get that?
Ah root[0] seems to do it. Thanks!
root[1] prints the <!-- Results have not yet been posted for this study --> for me.
Cheers, if it's working @Richard do consider marking this answer "Accepted".
1

Change match to search an then:

text = 'hello, world <!-- comment -->'
comment = re.search('<!--(.*?)-->', text)
comment.group(1)

Output:

' comment '

Comments

1

You have to use the re.findall() method to extract all substring that match a certain pattern.

re.match() will only check whether the pattern fits at the beginning of the string, while re.search() will only get you the first match within the string. For your purpose, re.findall() is definitely the right method and should be preferred.

Comments

1

XPath works just fine here: tree.xpath('//comment()'). For example removing all scripts, styles, and comments from DOM you could do:

tree = lxml.html.fromstring(html)
for el in tree.xpath('//script | //style | //comment()'):
    el.getparent.remove(el)

No BeautifulSoup.

Comments

0

You could use Beautiful Soup's to extract the comment in a for loop like this

from bs4 import BeautifulSoup, Comment

text = 'hello, world <!-- comment -->'

soup = BeautifulSoup(text, 'lxml')

for x in soup.findAll(text=lambda text:isinstance(text, Comment)):
    print(x)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.