5

I have a list of urls (unicode), and there is a lot of repetition. For example, urls http://www.myurlnumber1.com and http://www.myurlnumber1.com/foo+%bar%baz%qux lead to the same place.

So I need to weed out all of those duplicates.

My first idea was to check if the element's substring is in the list, like so:

for url in list:
    if url[:30] not in list:
        print(url)

However, it tries to mach literal url[:30] to a list element and obviously returns all of them, since there is no element that exactly matches url[:30].

Is there an easy way to solve this problem?

EDIT:

Often the host and path in the urls stays the same, but the parameters are different. For my purposes, a url with the same hostname and path, but different parameters are still the same url and constitute a duplicate.

2
  • Do you have same length of urls ? Commented Sep 27, 2016 at 13:05
  • 1
    Could you specify the filtering criteria more precisely? E.g. what output do you expect for the following URLs: "foo.com/bar", "foo.com/bar/boo" and "foo.com/baz"? Commented Sep 27, 2016 at 13:12

2 Answers 2

6

If you consider any netloc's to be the same you can parse with urllib.parse

from urllib.parse import  urlparse # python2 from urlparse import  urlparse 

u = "http://www.myurlnumber1.com/foo+%bar%baz%qux"

print(urlparse(u).netloc)

Which would give you:

www.myurlnumber1.com

So to get unique netlocs you could do something like:

unique  = {urlparse(u).netloc for u in urls}

If you wanted to keep the url scheme:

urls  = ["http://www.myurlnumber1.com/foo+%bar%baz%qux", "http://www.myurlnumber1.com"]

unique = {"{}://{}".format(u.scheme, u.netloc) for u in map(urlparse, urls)}
print(unique)

Presuming they all have schemes and you don't have http and https for the same netloc and consider them to be the same.

If you also want to add the path:

unique = {u.netloc, u.path) for u in map(urlparse, urls)}

The table of attributes is listed in the docs:

Attribute   Index   Value   Value if not present
scheme  0   URL scheme specifier    scheme parameter
netloc  1   Network location part   empty string
path    2   Hierarchical path   empty string
params  3   Parameters for last path element    empty string
query   4   Query component empty string
fragment    5   Fragment identifier empty string
username        User name   None
password        Password    None
hostname        Host name (lower case)  None
port        Port number as integer, if present  None

You just need to use whatever you consider to be the unique parts.

In [1]: from urllib.parse import  urlparse

In [2]: urls = ["http://www.url.com/foo-bar", "http://www.url.com/foo-bar?t=baz", "www.url.com/baz-qux",  "www.url.com/foo-bar?t=baz"]


In [3]: unique = {"".join((u.netloc, u.path)) for u in map(urlparse, urls)}

In [4]: 

In [4]: print(unique)
{'www.url.com/baz-qux', 'www.url.com/foo-bar'}
Sign up to request clarification or add additional context in comments.

7 Comments

Or make that dict comprehension if he actually needs full URLs.
Thanks, but this doesn't work for me, since sometimes the hostname is the same for multiple urls, it's always the path that is different. I've edited the question to reflect that.
@Zlo, I don't quite understand what you mean, what are you considering to the the same? This works on your input so you need to add more detail.
@PadraicCunningham, www.url.com/foo-bar and www.url.com/foo-bar&t=baz are the same for me, I want to filter retain only one of those. However, www.url.com/foo-bar and www.url.com/baz-quxare unique and I want to keep both in.
@Zlo, so you want netloc and path? So any query/params are ignored?
|
0

You can try adding another for loop, if you are fine with that. Something like:

for url in list:  
    for i in range(len(list)):  
      if url[:30] not in list[i]:  
          print(url)  

That will compare every word with every other word to check for sameness. That's just an example, I'm sure you could make it more robust.

2 Comments

Treating URLs as if they are simply strings is often not a good idea. For example, would we, for the purpose of this task, see http://google.com and https://google.com as different or the same?
@Daerdemandt you have a good point. I was pretty sure there was a more robust solution, I just threw out the first thing off the top of my head.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.