If you consider any netloc's to be the same you can parse with urllib.parse
from urllib.parse import urlparse # python2 from urlparse import urlparse
u = "http://www.myurlnumber1.com/foo+%bar%baz%qux"
print(urlparse(u).netloc)
Which would give you:
www.myurlnumber1.com
So to get unique netlocs you could do something like:
unique = {urlparse(u).netloc for u in urls}
If you wanted to keep the url scheme:
urls = ["http://www.myurlnumber1.com/foo+%bar%baz%qux", "http://www.myurlnumber1.com"]
unique = {"{}://{}".format(u.scheme, u.netloc) for u in map(urlparse, urls)}
print(unique)
Presuming they all have schemes and you don't have http and https for the same netloc and consider them to be the same.
If you also want to add the path:
unique = {u.netloc, u.path) for u in map(urlparse, urls)}
The table of attributes is listed in the docs:
Attribute Index Value Value if not present
scheme 0 URL scheme specifier scheme parameter
netloc 1 Network location part empty string
path 2 Hierarchical path empty string
params 3 Parameters for last path element empty string
query 4 Query component empty string
fragment 5 Fragment identifier empty string
username User name None
password Password None
hostname Host name (lower case) None
port Port number as integer, if present None
You just need to use whatever you consider to be the unique parts.
In [1]: from urllib.parse import urlparse
In [2]: urls = ["http://www.url.com/foo-bar", "http://www.url.com/foo-bar?t=baz", "www.url.com/baz-qux", "www.url.com/foo-bar?t=baz"]
In [3]: unique = {"".join((u.netloc, u.path)) for u in map(urlparse, urls)}
In [4]:
In [4]: print(unique)
{'www.url.com/baz-qux', 'www.url.com/foo-bar'}