0

I am new to scrapy and I've come across a complicated case.

My problem is that sometimes I have links like https://sitename.com/path2/?param1=value1&param2=value2 and for me, query string is not important and I want to Drop it from requests.
I mean this part of the url: ?param1=value1&param2=value2

After a day of research, I realized that this should be done in the middlewares.py file (Downloader Middleware) (Source). Because requests and receipts in Scrapy go through this path.
I tried to write a code so that the requests and answers are without query string, but I did not succeed.
My code does not drop requests that include query string.
middlewares.py:

from w3lib.url import url_query_cleaner

class CleanUrlAgentDownloaderMiddleware:

    def process_response(self, request, response, spider):
        url_query_cleaner(response.url)
        return response

    def process_request(self, request, spider):
        url_query_cleaner(request.url)

How can I release these requests using the w3lib.url library or using Python codes? And don't enter Scrapy?
Just to let you know that I set my class in the settings.py

1 Answer 1

3

Since strings are immutable, your code will not change the anything in the requests. for your code to work you have to do

from w3lib.url import url_query_cleaner

class CleanUrlAgentDownloaderMiddleware:
    # No need for process response since it will have the same 
    # url as the request

    def process_request(self, request, spider):
        if "?" in request.url:
            return request.replace(url=url_query_cleaner(request.url))

alternately, if you want to ignore requests that have queries in their url you can do

from scrapy.exceptions import IgnoreRequest
from urllib.parse import urlparse

class IgnoreQueryRequestMiddleware:
    def process_request(self, request, spider):
        if urlparse(request.url).query:
            raise IgnoreRequest
Sign up to request clarification or add additional context in comments.

15 Comments

@Sardar sorry i forgot about that. i updated the answer
if you want to ignore request that has query you can raise IgnoreRequest in middleware docs.scrapy.org/en/latest/topics/…
i added an example to the answer :)
Yes you can use urlib.parse.urlparse but idk if it's the best tool for this i will edit the post to add it
i think you can do something like this but i wouldn't recommend >>> url = "example.com/why-should-we-drink-water?".strip("?") >>> len(url) != len(url_query_cleaner(url))
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.