1

I need to parse a large number of URLs to retrieve a guid using urllib/python3. Some urls contain a fragment which causes problems with returning the parameters.

import urllib

url = "https://zzz.com/index.html#viewer?guid=6a755e6d-4eae&Link=true&psession=true")
parse_response = urllib.parse.urlsplit(self.url)
self.logger.info("The parsed url components = {}".format(parse_response))

The parsed url components = SplitResult(scheme='https', netloc='abc.com', path='/index.html', query='', fragment='viewer?guid=6a755e6d-4eae&Link=true&psession=true')]

So urllib rightly sees the "#" and stores the rest of the URL as a fragment, and will not return the parameters. What is the best way to process the URL's with and without fragments?

1 Answer 1

2
from urllib.parse import urlparse, urldefrag, parse_qs
url = " https://abc.xyz.com/url/with/fragment/query#param1=val1&param2=val2&param3=val3"

print(urlparse(url))
print(urlparse(url).fragment)

pq = parse_qs(urlparse(url).fragment)
print(pq)
print(type(pq))
print("Using urlparse {}".format((pq["access_token"][0])))


f = urldefrag(url).fragment
print(type(f))
print(f)

pq = parse_qs(f)
print(pq)
print(type(pq))
print("Using urldefrag {}".format((pq["access_token"][0])))
Sign up to request clarification or add additional context in comments.

1 Comment

According to RFC 3986, the fragment starts after the # character, so in your example, param1=val1&param2=val2&param3=val3 is the fragment. parse_qs and parse_qsl were made to parse the query, not the fragment. Your code works for an URI like http://localhost#param1=value1&param2=value1, but fails with http://localhost#param1&value1.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.