0

I have the following input string

input= """href="http://www.sciencedirect.com/science/article/pii/S0167923609002097" onmousedown="return scife_clk(this.href,'','res','2')">Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast</a></h3><div class="gs_a">N Li, <a href="/citations?
    href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'ggp','res','1')">How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience</a></h3><div class="gs_a"><a href="/citations?
    href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'gga','gga','1')"><span class="gs_ggsL"><span class=gs_ctg2>[HTML]</span> from nih.gov</span><span class="gs_ggsS">nih.gov <span """

I want to extract the following output from this:

>> Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast
>> How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience

I am trying to use re package in python but I am not clear what regular expression to use, since there are several patterns like :

(this.href,'','res','2')"> or (this.href,'ggp','res','2')"> or (this.href,'gga','gga','2')">

I am using this regular expression:

=re.search(r"(this.href,'ggp.?','res','.?/D')"

But it is not working for me. Can anyone tell what re to use?

2
  • 3
    Why not use a HTML parser instead? regular expressions are not the right tool here. Commented Apr 15, 2013 at 15:06
  • which HTML parser to use? Commented Apr 15, 2013 at 15:58

2 Answers 2

1

This works with your example:

input= """\
href="http://www.sciencedirect.com/science/article/pii/S0167923609002097" onmousedown="return scife_clk(this.href,'','res','2')">Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast</a></h3><div class="gs_a">N Li, <a href="/citations?
href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'ggp','res','1')">How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience</a></h3><div class="gs_a"><a href="/citations?
href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309177/" onmousedown="return scife_clk(this.href,'gga','gga','1')"><span class="gs_ggsL"><span class=gs_ctg2>[HTML]</span> from nih.gov</span><span class="gs_ggsS">nih.gov <span """

import re

for line in input.splitlines():
    m=re.search(r'onmousedown=.*?">(.*)</a>',line)
    if m:
        print(m.group(1))

Prints:

Using <b>text mining </b>and sentiment analysis for online forums hotspot detection and forecast
How to link ontologies and protein–protein interactions to literature: <b>text</b>-<b>mining </b>approaches and the BioCreative experience

Bear in mind that using a Regex with HTML is potentially a mine field (or mind field!) and it usually recommended to use a parser. But with snippets, you can make it work...

Sign up to request clarification or add additional context in comments.

1 Comment

This won't work if there are 2 or more <a> tags on the same line.
1

You'd be much better off using a decent HTML Parser. Use BeautifulSoup for example:

from bs4 import BeautifulSoup

soup = BeautifulSoup(input)

for link in soup.find_all('a', onmousedown=True):
    print link.text

which finds all <a> elements with an onmousedown attribute.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.