0

I have a dataset that, in part, has a URL field indicating the location of a resource. Some URLs are persistent (e.g. handles and DOIs) and thus, need to be resolved to their original URL. I am primarily working with Python and the solution that seems to work, thus far, involves using the Requests HTTP library.

import requests
var_output_url = requests.get("http://hdl.handle.net/10179/619")
var_output_url.url

While this solution works, it is quite slow as I have to loop through ~4,000 files, each with around 2,000 URLs. Is there a more efficient way of resolving the URL redirects?

I tested my current solution on one batch and it took almost 5 minutes; at this rate, it will take me a couple of days (13 days) to process all the batches [...] I know, it will not necessarily be that long and I can run them in parallel

3
  • if you need the content of each url then trying to optimize network time from url redirection in request is the wrong approach. Commented Apr 21, 2019 at 23:17
  • I actually do not need the content. I need to resolve to the original URL because I am wanting to determine the domain the content is from. Commented Apr 21, 2019 at 23:35
  • ok then my answer is below Commented Apr 21, 2019 at 23:45

1 Answer 1

2

Using HEAD instead of GET should give you only headers and not the resource body, which in your example is html page. If you only need resolving url redirections, it would result in quite less time on data transfer over the network. Use parameter allow_redirects=True to allow redirection.

var_output_url = requests.head("http://hdl.handle.net/10179/619", allow_redirects=True)
var_output_url.url
>>> 'https://mro.massey.ac.nz/handle/10179/619'
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for this. I did some very basic benchmark and it turns out, I get results 40% faster using head than get: using the same payload, get takes 20 minutes, while head takes 12 minutes. This most certainly will help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.