2

This is my string :

content = '<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>Yadgiri</h5></span></td></tr>'

I have tried below regular expression to extract the text which is in between h5 element tag:

   reg = re.search(r'<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>([A-Za-z0-9%s]+)</h5></span></td></tr>' % string.punctuation,content)

It's exactly returns what I wants.

Is there any more pythonic way to get this one ?

4
  • Yes. Look at Beautiful Soup 4. Commented Jan 18, 2018 at 12:28
  • i want in regular expression instead of beautifulsoup and scrapy. Commented Jan 18, 2018 at 12:30
  • 2
    Do NOT use regex for parsing html/xml/tag-style data. See here Commented Jan 18, 2018 at 12:33
  • @James Thanks for the Info. Commented Jan 18, 2018 at 12:59

1 Answer 1

2

Dunno whether this qualifies as more pythonic or not, but it handles it as HTML data.

from lxml import html
content = '<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>Yadgiri</h5></span></td></tr>'
HtmlData = html.fromstring(content)
ListData = HtmlData.xpath(‘//text()’)

And to get the last element:

ListData[-1]
Sign up to request clarification or add additional context in comments.

1 Comment

To install on a Debian based system use python3-lxml

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.