extracting string using regular expression

Question

I have the following input string

input= """href="http://www.sciencedirect.com/science/article/pii/S0167923609002097" onmousedown="return scife_clk(this.href,'','res','2')">Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast</a></h3><div class="gs_a">N Li, <a href="/citations?
    href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'ggp','res','1')">How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience</a></h3><div class="gs_a"><a href="/citations?
    href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'gga','gga','1')"><span class="gs_ggsL"><span class=gs_ctg2>[HTML]</span> from nih.gov</span><span class="gs_ggsS">nih.gov <span """

I want to extract the following output from this:

>> Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast
>> How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience

I am trying to use re package in python but I am not clear what regular expression to use, since there are several patterns like :

(this.href,'','res','2')"> or (this.href,'ggp','res','2')"> or (this.href,'gga','gga','2')">

I am using this regular expression:

=re.search(r"(this.href,'ggp.?','res','.?/D')"

But it is not working for me. Can anyone tell what re to use?

Why not use a HTML parser instead? regular expressions are not the right tool here. — Martijn Pieters
– Martijn Pieters, Commented Apr 15, 2013 at 15:06

dawg · Accepted Answer · 2013-04-15 16:02:38Z

1

This works with your example:

input= """\
href="http://www.sciencedirect.com/science/article/pii/S0167923609002097" onmousedown="return scife_clk(this.href,'','res','2')">Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast</a></h3><div class="gs_a">N Li, <a href="/citations?
href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'ggp','res','1')">How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience</a></h3><div class="gs_a"><a href="/citations?
href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'gga','gga','1')"><span class="gs_ggsL"><span class=gs_ctg2>[HTML]</span> from nih.gov</span><span class="gs_ggsS">nih.gov <span """

import re

for line in input.splitlines():
    m=re.search(r'onmousedown=.*?">(.*)</a>',line)
    if m:
        print(m.group(1))

Prints:

Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast
How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience

Bear in mind that using a Regex with HTML is potentially a mine field (or mind field!) and it usually recommended to use a parser. But with snippets, you can make it work...

answered Apr 15, 2013 at 16:02

dawg

105k24 gold badges143 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Martijn Pieters Over a year ago

This won't work if there are 2 or more <a> tags on the same line.

Martijn Pieters · Accepted Answer · 2013-04-15 16:03:44Z

1

You'd be much better off using a decent HTML Parser. Use BeautifulSoup for example:

from bs4 import BeautifulSoup

soup = BeautifulSoup(input)

for link in soup.find_all('a', onmousedown=True):
    print link.text

which finds all <a> elements with an onmousedown attribute.

answered Apr 15, 2013 at 16:03

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Collectives™ on Stack Overflow

extracting string using regular expression

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related