Checking if element in list by substring

Question

I have a list of urls (unicode), and there is a lot of repetition. For example, urls http://www.myurlnumber1.com and http://www.myurlnumber1.com/foo+%bar%baz%qux lead to the same place.

So I need to weed out all of those duplicates.

My first idea was to check if the element's substring is in the list, like so:

for url in list:
    if url[:30] not in list:
        print(url)

However, it tries to mach literal url[:30] to a list element and obviously returns all of them, since there is no element that exactly matches url[:30].

Is there an easy way to solve this problem?

EDIT:

Often the host and path in the urls stays the same, but the parameters are different. For my purposes, a url with the same hostname and path, but different parameters are still the same url and constitute a duplicate.

Could you specify the filtering criteria more precisely? E.g. what output do you expect for the following URLs: "foo.com/bar", "foo.com/bar/boo" and "foo.com/baz"? — Sergei Lebedev
– Sergei Lebedev, Commented Sep 27, 2016 at 13:12

Padraic Cunningham · Accepted Answer · 2016-09-27 13:41:12Z

6

If you consider any netloc's to be the same you can parse with urllib.parse

from urllib.parse import  urlparse # python2 from urlparse import  urlparse 

u = "http://www.myurlnumber1.com/foo+%bar%baz%qux"

print(urlparse(u).netloc)

Which would give you:

www.myurlnumber1.com

So to get unique netlocs you could do something like:

unique  = {urlparse(u).netloc for u in urls}

If you wanted to keep the url scheme:

urls  = ["http://www.myurlnumber1.com/foo+%bar%baz%qux", "http://www.myurlnumber1.com"]

unique = {"{}://{}".format(u.scheme, u.netloc) for u in map(urlparse, urls)}
print(unique)

Presuming they all have schemes and you don't have http and https for the same netloc and consider them to be the same.

If you also want to add the path:

unique = {u.netloc, u.path) for u in map(urlparse, urls)}

The table of attributes is listed in the docs:

Attribute   Index   Value   Value if not present
scheme  0   URL scheme specifier    scheme parameter
netloc  1   Network location part   empty string
path    2   Hierarchical path   empty string
params  3   Parameters for last path element    empty string
query   4   Query component empty string
fragment    5   Fragment identifier empty string
username        User name   None
password        Password    None
hostname        Host name (lower case)  None
port        Port number as integer, if present  None

You just need to use whatever you consider to be the unique parts.

In [1]: from urllib.parse import  urlparse

In [2]: urls = ["http://www.url.com/foo-bar", "http://www.url.com/foo-bar?t=baz", "www.url.com/baz-qux",  "www.url.com/foo-bar?t=baz"]


In [3]: unique = {"".join((u.netloc, u.path)) for u in map(urlparse, urls)}

In [4]: 

In [4]: print(unique)
{'www.url.com/baz-qux', 'www.url.com/foo-bar'}

edited Sep 27, 2016 at 13:41

answered Sep 27, 2016 at 13:11

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Daerdemandt Over a year ago

Or make that dict comprehension if he actually needs full URLs.

Zlo Over a year ago

Thanks, but this doesn't work for me, since sometimes the hostname is the same for multiple urls, it's always the path that is different. I've edited the question to reflect that.

Padraic Cunningham Over a year ago

@Zlo, I don't quite understand what you mean, what are you considering to the the same? This works on your input so you need to add more detail.

Zlo Over a year ago

@PadraicCunningham, www.url.com/foo-bar and www.url.com/foo-bar&t=baz are the same for me, I want to filter retain only one of those. However, www.url.com/foo-bar and www.url.com/baz-quxare unique and I want to keep both in.

Padraic Cunningham Over a year ago

@Zlo, so you want netloc and path? So any query/params are ignored?

|

Himself12794 · Accepted Answer · 2016-09-27 13:10:53Z

0

You can try adding another for loop, if you are fine with that. Something like:

for url in list:  
    for i in range(len(list)):  
      if url[:30] not in list[i]:  
          print(url)

That will compare every word with every other word to check for sameness. That's just an example, I'm sure you could make it more robust.

answered Sep 27, 2016 at 13:10

Himself12794

2613 silver badges13 bronze badges

2 Comments

Daerdemandt Over a year ago

Treating URLs as if they are simply strings is often not a good idea. For example, would we, for the purpose of this task, see http://google.com and https://google.com as different or the same?

Himself12794 Over a year ago

@Daerdemandt you have a good point. I was pretty sure there was a more robust solution, I just threw out the first thing off the top of my head.

Collectives™ on Stack Overflow

Checking if element in list by substring

2 Answers 2

7 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related