Extract portions of text if Regex in Python

Question

I have a a previously matched pattern such as:

<a href="somelink here something">

Now I wish to extract only the value of a specific attribute(s) in the tag such but this may be anything an occur anywhere in the tag.

regex_pattern=re.compile('href=\"(.*?)\"')

Now I can use the above to match the attribute and the value part but I need to extract only the (.*?) part. (Value)

I can ofcourse strip href=" and " later but I'm sure I can use regex properly to extract only the required part.

In simple words I want to match

abcdef=\"______________________\"

in the pattern but want only the

____________________

Part

How do I do this?

I'll post the obligatory link to this StackOverflow classic. Are you sure you don't want to use a proper XML parser? — Lauritz V. Thaulow
– Lauritz V. Thaulow, Commented Jul 27, 2012 at 9:09
Though that link is fun, I see no harm in matching specific attributes with a regular expression. Do be aware of the limitations of regular expressions when it comes to dealing with XML and HTML (which the linked post so entertainingly expands upon); use lxml or BeautifulSoup instead. — Martijn Pieters
– Martijn Pieters, Commented Jul 27, 2012 at 9:22
I'll agree HTML is a bit of a hassle if you want parse anykind(abstract) form of HTML in general. I.e if you try to create an all purpose parser or something but for fixed format pages(scraping purposes :P ) it works though I still need a little help from the outside. — ffledgling
– ffledgling, Commented Jul 27, 2012 at 23:24

Ben Ruijl · Accepted Answer · 2012-07-27 09:21:37Z

2

Just use re.search('href=\"(.*?)\"', yourtext).group(1) on the matched string yourtext and it will yield the matched group.

edited Jul 27, 2012 at 9:21

answered Jul 27, 2012 at 8:57

Ben Ruijl

5,1533 gold badges33 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Martijn Pieters Over a year ago

.search() returns a MatchObject instance, not a list of matched groups.

Martijn Pieters Over a year ago

And group(1) doesn't return a list of matched groups either, but your answer is now only slightly confusing, not just plain wrong. :-P

Ben Ruijl Over a year ago

haha, true. I initially had findall instead of search and then it all makes sense (if you have 1 group)

Martijn Pieters · Accepted Answer · 2012-07-27 09:11:58Z

Take a look at the .group() method on regular expression MatchObject results.

Your regular expression has an explicit group match group (the part in () parethesis), and the .group() method gives you direct access to the string that was matched within that group. MatchObject are returned by several re functions and methods, including the .search() and .finditer() functions.

Demonstration:

>>> import re
>>> example = '<a href="somelink here something">'
>>> regex_pattern=re.compile('href=\"(.*?)\"') 
>>> regex_pattern.search(example)
<_sre.SRE_Match object at 0x1098a2b70>
>>> regex_pattern.search(example).group(1)
'somelink here something'

From the Regular Expression syntax documentation on the (...) parenthesis syntax:

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

Collectives™ on Stack Overflow

Extract portions of text if Regex in Python

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related