4

I am trying to search whole word pid in the link but somewhat this is also searching for id in this code

    for a in self.soup.find_all(href=True):

        if 'pid' in a['href']:
            href = a['href']
            if not href or len(href) <= 1:
                continue
            elif 'javascript:' in href.lower():
                continue
            else:
                href = href.strip()
            if href[0] == '/':
                href = (domain_link + href).strip()
            elif href[:4] == 'http':
                href = href.strip()
            elif href[0] != '/' and href[:4] != 'http':
                href = ( domain_link + '/' + href ).strip()
            if '#' in href:
                indx = href.index('#')
                href = href[:indx].strip()
            if href in links:
                continue

            links.append(self.re_encode(href))
5
  • Sorry I mean regular expression Commented Sep 5, 2015 at 0:40
  • I'm not clear what's wrong here. Can you make it clear which part of the code you're having problems with, and specifically how it is behaving now and how you want it to behave? Commented Sep 5, 2015 at 1:53
  • I think this might be a duplicate of test string for a substring Commented Sep 5, 2015 at 1:55
  • What is some sample input that doesn't work? How do you know that that sample input doesn't work? What would it output if it was working properly? Commented Sep 5, 2015 at 2:01
  • if 'pid' it recognises all the pid also sid also id where, I just want to get the whole word 'pid' into the search. Commented Sep 5, 2015 at 2:06

1 Answer 1

3

If you mean that you want it to match a string like /pid/0002 but not /rapid.html, then you need to exclude word characters on either side. Something like:

>>> re.search(r'\Wpid\W', '/pid/0002')
<_sre.SRE_Match object; span=(0, 5), match='/pid/'>
>>> re.search(r'\Wpid\W', '/rapid/123')
None

If 'pid' might be at the start or end of the string, you'll need to add extra conditions: check for either the start/end of line or a non-word character:

>>> re.search(r'(^|\W)pid($|\W)', 'pid/123')
<_sre.SRE_Match object; span=(0, 4), match='pid/'>

See the docs for more information on the special characters.

You could use it like this:

pattern = re.compile(r'(^|\W)pid($|\W)')
if pattern.search(a['href']) is not None:
    ...
Sign up to request clarification or add additional context in comments.

3 Comments

Actually there are three situation one is ?pid= , one is where it takes sid=tyy,4mr&icmpid and another one only with id like Widget etc. I just want to show the first one with only ?pid
Thanks I used this expression and it worked pattern = re.compile(r'(\?pid\=)')
Cool. But in that case you might like to do proper URL parsing. Python has libraries to help: see urllib.parse (py3) and urlparse (py2). Makes it easy to handle other cases like where the pid argument isn't first (&pid=...).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.