0

I have a a previously matched pattern such as:

<a href="somelink here something">

Now I wish to extract only the value of a specific attribute(s) in the tag such but this may be anything an occur anywhere in the tag.

regex_pattern=re.compile('href=\"(.*?)\"') 

Now I can use the above to match the attribute and the value part but I need to extract only the (.*?) part. (Value)

I can ofcourse strip href=" and " later but I'm sure I can use regex properly to extract only the required part.

In simple words I want to match

abcdef=\"______________________\"

in the pattern but want only the

____________________

Part

How do I do this?

3
  • 1
    I'll post the obligatory link to this StackOverflow classic. Are you sure you don't want to use a proper XML parser? Commented Jul 27, 2012 at 9:09
  • 1
    Though that link is fun, I see no harm in matching specific attributes with a regular expression. Do be aware of the limitations of regular expressions when it comes to dealing with XML and HTML (which the linked post so entertainingly expands upon); use lxml or BeautifulSoup instead. Commented Jul 27, 2012 at 9:22
  • I'll agree HTML is a bit of a hassle if you want parse anykind(abstract) form of HTML in general. I.e if you try to create an all purpose parser or something but for fixed format pages(scraping purposes :P ) it works though I still need a little help from the outside. Commented Jul 27, 2012 at 23:24

2 Answers 2

2

Just use re.search('href=\"(.*?)\"', yourtext).group(1) on the matched string yourtext and it will yield the matched group.

Sign up to request clarification or add additional context in comments.

3 Comments

.search() returns a MatchObject instance, not a list of matched groups.
And group(1) doesn't return a list of matched groups either, but your answer is now only slightly confusing, not just plain wrong. :-P
haha, true. I initially had findall instead of search and then it all makes sense (if you have 1 group)
1

Take a look at the .group() method on regular expression MatchObject results.

Your regular expression has an explicit group match group (the part in () parethesis), and the .group() method gives you direct access to the string that was matched within that group. MatchObject are returned by several re functions and methods, including the .search() and .finditer() functions.

Demonstration:

>>> import re
>>> example = '<a href="somelink here something">'
>>> regex_pattern=re.compile('href=\"(.*?)\"') 
>>> regex_pattern.search(example)
<_sre.SRE_Match object at 0x1098a2b70>
>>> regex_pattern.search(example).group(1)
'somelink here something'

From the Regular Expression syntax documentation on the (...) parenthesis syntax:

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.