Python Regex - exclude url containing a word

Question

I have a problem with regex - I have 4 examples of urls:

http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo 
http://auto.com/index.php/car-news/11654-battle-royale-2014
http://auto.com/index.php/tv-special-news/10480-new-film-4
http://auto.com/index.php/first/12234-new-volvo-xc60

I would like to exclude urls with 'tv-special-news' inside or 'photo' at the end.

I've tried:

http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}-(?!photo)

but it does not work exactly as I want

I think you can do this without regex, use 'tv-special-news' in url and .endswith — Vinícius Figueiredo
– Vinícius Figueiredo, Commented Aug 1, 2017 at 15:52

poke · Accepted Answer · 2017-08-01 17:03:44Z

2

http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}-(?!photo)

You were close with this. You just have to remove the dash before the (?!photo) to allow lines to end without a trailing dash and add a $ to the end to make sure that the whole line needs to be matched.

And then you will also have to change the negative lookahead into a negative look behind to make sure that you are not matching the line end if it is preceded by photo: (?<!photo).

http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}(?<!photo)$

Also, you should escape all dots properly:

http://(www\.)?auto\.com/index\.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]+(?<!photo)$

Also, the quantifier {1,} is equivalent to +.

edited Aug 1, 2017 at 17:03

answered Aug 1, 2017 at 15:59

poke

392k80 gold badges596 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

anubhava Over a year ago

This is incorrect regex. You can see this demo

poke Over a year ago

@anubhava My bad, there was a trailing space on OP’s input which made me miss this. Fixed it now, thanks!

anubhava · Accepted Answer · 2017-08-01 16:48:23Z

1

You may use this regex:

^(?!.*-photo$)http://(?:www\.)?auto\.com/index\.php/(?!tv-special-news)[^/]+/[\w-]+-

RegEx Demo 1

(?!.*-photo$) is negative lookahead to fail the match if URL ends with photo.
(?!tv-special-news) is negative lookahead to assert failure when tv-special-news appears after /index.php/.
Better to use start anchor in your regex

Or with lookbehind regex, you can use:

^http://(www\.)?auto\.com/index\.php/(?!tv-special-news).*/[a-zA-Z0-9-]+$(?<!photo)

RegEx Demo 2

edited Aug 1, 2017 at 16:48

answered Aug 1, 2017 at 15:54

anubhava

790k67 gold badges603 silver badges671 bronze badges

Comments

Ajax1234 · Accepted Answer · 2017-08-01 15:59:52Z

0

You can use this solution:

import re

list_of_urls = ["http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo",....]


new_list = [i for i in list_of_urls if len(re.findall("photo+", i.split()[-1])) == 0 and len(re.findall("tv-special-news+", i.split()[-1])) == 0]

edited Aug 1, 2017 at 15:59

answered Aug 1, 2017 at 15:58

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

1 Comment

Alek SZ Over a year ago

Thank you, but I need regex

Nikhil Yadav · Accepted Answer · 2017-08-01 16:18:46Z

You can simply store your link in the list and iterate over it using regex:

re_pattern = r'\b(?:tv-special-news|photo)\b'

re.findall(re_pattern,link)

(where link will be items from the list)

If the patterns matches then, it will store the result in the list. you will have to just check if the list is empty or not. If list is empty you can include the link else exclude it.

Here is the sample code:

import re

links = ['http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo', 'http://auto.com/index.php/car-news/11654-battle-royale-2014', 'http://auto.com/index.php/tv-special-news/10480-new-film-4', 'http://auto.com/index.php/first/12234-new-volvo-xc60']

new_list = []

re_pattern = r'\b(?:tv-special-news|photo)\b' for link in links:    result = re.findall(re_pattern,link)        if len(result) < 1:         new_list.append(link)   

print new_list

Collectives™ on Stack Overflow

Python Regex - exclude url containing a word

4 Answers 4

2 Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related