Extract HTML comments in Python, using regex or lxml?

Question

How do I extract all HTML-style comments from a document, using Python?

I've tried using a regex:

text = 'hello, world <!-- comment -->'
re.match('<!--(.*?)-->', text)

But it produces nothing. I don't understand this since the same regex works fine on the same string at https://regex101.com/

UPDATE: My document is actually an XML file, and I'm parsing the document with pyquery (based on lxml), but I don't think lxml can extract comments that aren't inside a node. This is what the document looks like:

<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="220398">
  <intervention_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm  -->
    <mesh_term>Freund's Adjuvant</mesh_term>
    <mesh_term>Keyhole-limpet hemocyanin</mesh_term>
  </intervention_browse>
  <!-- Results have not yet been posted for this study                                -->
</clinical_study>

UPDATE 2: Thanks for suggesting the other answer, but I'm already parsing the document extensively with lxml and don't want to rewrite everything with BeautifulSoup. Have updated title accordingly.

This would be trivial and more reliable using lxml or beautifulsoup — Padraic Cunningham
– Padraic Cunningham, Commented Jul 27, 2016 at 14:59
@MaxU I'm already using lxml (pyquery) so I don't really want to switch to BeautifulSoup, but thanks. I've updated the question to be clear that I'm happy to use regex or lxml. — Richard
– Richard, Commented Jul 27, 2016 at 15:02
@Padraic I'm not sure it is actually possible in lxml, see the update. — Richard
– Richard, Commented Jul 27, 2016 at 15:02
@Richard dox you linked to suggest you can determine whether the tag is an etree.comment -- have you tried that? And then if True could just print the tag property value? — David Zemens
– David Zemens, Commented Jul 27, 2016 at 15:10
@DavidZemens problem is that there is no tag, the comment is just floating. — Richard
– Richard, Commented Jul 27, 2016 at 15:11

David Zemens · Accepted Answer · 2016-07-27 15:21:07Z

2

This seems to print the comment for me:

from lxml import etree
txt = """<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="220398">
  <intervention_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm  -->
    <mesh_term>Freund's Adjuvant</mesh_term>
    <mesh_term>Keyhole-limpet hemocyanin</mesh_term>
  </intervention_browse>
  <!-- Results have not yet been posted for this study                                -->
</clinical_study>"""
root = etree.XML(txt)
print root[0][0]

To get the last comment:

comments = [itm for itm in root if itm.tag is etree.Comment]:
if comments:
    print comments[-1]

edited Jul 27, 2016 at 15:21

answered Jul 27, 2016 at 15:11

David Zemens

53.8k12 gold badges86 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Richard Over a year ago

Thanks! It's actually the last comment I care about (in a document with an arbitrary number of comments, though the last comment is always just before the closing clinical_study tag) any idea how you'd get that?

Richard Over a year ago

Ah root[0] seems to do it. Thanks!

David Zemens Over a year ago

root[1] prints the  for me.

David Zemens Over a year ago

Cheers, if it's working @Richard do consider marking this answer "Accepted".

pawelty · Accepted Answer · 2016-07-27 15:04:45Z

1

Change match to search an then:

text = 'hello, world <!-- comment -->'
comment = re.search('<!--(.*?)-->', text)
comment.group(1)

Output:

' comment '

answered Jul 27, 2016 at 15:04

pawelty

1,00013 silver badges28 bronze badges

Comments

Rafael Albert · Accepted Answer · 2016-07-27 15:10:26Z

1

You have to use the re.findall() method to extract all substring that match a certain pattern.

re.match() will only check whether the pattern fits at the beginning of the string, while re.search() will only get you the first match within the string. For your purpose, re.findall() is definitely the right method and should be preferred.

answered Jul 27, 2016 at 15:10

Rafael Albert

4552 silver badges8 bronze badges

Comments

Dharman · Accepted Answer · 2021-04-11 10:08:54Z

1

XPath works just fine here: tree.xpath('//comment()'). For example removing all scripts, styles, and comments from DOM you could do:

tree = lxml.html.fromstring(html)
for el in tree.xpath('//script | //style | //comment()'):
    el.getparent.remove(el)

No BeautifulSoup.

edited Apr 11, 2021 at 10:08

Dharman♦

34k27 gold badges106 silver badges157 bronze badges

answered Apr 11, 2021 at 10:03

Pero

1,45118 silver badges21 bronze badges

Comments

Andrew Feather · Accepted Answer · 2016-07-27 15:04:44Z

0

You could use Beautiful Soup's to extract the comment in a for loop like this

from bs4 import BeautifulSoup, Comment

text = 'hello, world <!-- comment -->'

soup = BeautifulSoup(text, 'lxml')

for x in soup.findAll(text=lambda text:isinstance(text, Comment)):
    print(x)

answered Jul 27, 2016 at 15:04

Andrew Feather

1833 silver badges14 bronze badges

Collectives™ on Stack Overflow

Extract HTML comments in Python, using regex or lxml?

5 Answers 5

4 Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related