0

I have a problem with regex - I have 4 examples of urls:

http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo 
http://auto.com/index.php/car-news/11654-battle-royale-2014
http://auto.com/index.php/tv-special-news/10480-new-film-4
http://auto.com/index.php/first/12234-new-volvo-xc60

I would like to exclude urls with 'tv-special-news' inside or 'photo' at the end.

I've tried:

http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}-(?!photo)

but it does not work exactly as I want

2
  • 1
    I think you can do this without regex, use 'tv-special-news' in url and .endswith Commented Aug 1, 2017 at 15:52
  • unfortunately I need regex:) Commented Aug 1, 2017 at 15:53

4 Answers 4

2
http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}-(?!photo)

You were close with this. You just have to remove the dash before the (?!photo) to allow lines to end without a trailing dash and add a $ to the end to make sure that the whole line needs to be matched.

And then you will also have to change the negative lookahead into a negative look behind to make sure that you are not matching the line end if it is preceded by photo: (?<!photo).

http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}(?<!photo)$

Also, you should escape all dots properly:

http://(www\.)?auto\.com/index\.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]+(?<!photo)$

Also, the quantifier {1,} is equivalent to +.

Sign up to request clarification or add additional context in comments.

2 Comments

This is incorrect regex. You can see this demo
@anubhava My bad, there was a trailing space on OP’s input which made me miss this. Fixed it now, thanks!
1

You may use this regex:

^(?!.*-photo$)http://(?:www\.)?auto\.com/index\.php/(?!tv-special-news)[^/]+/[\w-]+-

RegEx Demo 1

  • (?!.*-photo$) is negative lookahead to fail the match if URL ends with photo.
  • (?!tv-special-news) is negative lookahead to assert failure when tv-special-news appears after /index.php/.
  • Better to use start anchor in your regex

Or with lookbehind regex, you can use:

^http://(www\.)?auto\.com/index\.php/(?!tv-special-news).*/[a-zA-Z0-9-]+$(?<!photo)

RegEx Demo 2

Comments

0

You can use this solution:

import re

list_of_urls = ["http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo",....]


new_list = [i for i in list_of_urls if len(re.findall("photo+", i.split()[-1])) == 0 and len(re.findall("tv-special-news+", i.split()[-1])) == 0]

1 Comment

Thank you, but I need regex
0

You can simply store your link in the list and iterate over it using regex:

re_pattern = r'\b(?:tv-special-news|photo)\b'

re.findall(re_pattern,link)

(where link will be items from the list)

If the patterns matches then, it will store the result in the list. you will have to just check if the list is empty or not. If list is empty you can include the link else exclude it.

Here is the sample code:

import re

links = ['http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo', 'http://auto.com/index.php/car-news/11654-battle-royale-2014', 'http://auto.com/index.php/tv-special-news/10480-new-film-4', 'http://auto.com/index.php/first/12234-new-volvo-xc60']

new_list = []

re_pattern = r'\b(?:tv-special-news|photo)\b' for link in links:    result = re.findall(re_pattern,link)        if len(result) < 1:         new_list.append(link)   

print new_list

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.