Drop requests that include query string in Scrapy

Question

I am new to scrapy and I've come across a complicated case.

My problem is that sometimes I have links like https://sitename.com/path2/?param1=value1&param2=value2 and for me, query string is not important and I want to Drop it from requests.
I mean this part of the url: ?param1=value1&param2=value2

After a day of research, I realized that this should be done in the middlewares.py file (Downloader Middleware) (Source). Because requests and receipts in Scrapy go through this path.
I tried to write a code so that the requests and answers are without query string, but I did not succeed.
My code does not drop requests that include query string.
middlewares.py:

from w3lib.url import url_query_cleaner

class CleanUrlAgentDownloaderMiddleware:

    def process_response(self, request, response, spider):
        url_query_cleaner(response.url)
        return response

    def process_request(self, request, spider):
        url_query_cleaner(request.url)

How can I release these requests using the w3lib.url library or using Python codes? And don't enter Scrapy?
Just to let you know that I set my class in the settings.py

zaki98 · Accepted Answer · 2022-08-18 13:11:54Z

3

Since strings are immutable, your code will not change the anything in the requests. for your code to work you have to do

from w3lib.url import url_query_cleaner

class CleanUrlAgentDownloaderMiddleware:
    # No need for process response since it will have the same 
    # url as the request

    def process_request(self, request, spider):
        if "?" in request.url:
            return request.replace(url=url_query_cleaner(request.url))

alternately, if you want to ignore requests that have queries in their url you can do

from scrapy.exceptions import IgnoreRequest
from urllib.parse import urlparse

class IgnoreQueryRequestMiddleware:
    def process_request(self, request, spider):
        if urlparse(request.url).query:
            raise IgnoreRequest

edited Aug 18, 2022 at 13:11

answered Aug 17, 2022 at 16:42

zaki98

1,1168 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

zaki98 Over a year ago

@Sardar sorry i forgot about that. i updated the answer

zaki98 Over a year ago

if you want to ignore request that has query you can raise IgnoreRequest in middleware docs.scrapy.org/en/latest/topics/…

zaki98 Over a year ago

i added an example to the answer :)

zaki98 Over a year ago

Yes you can use urlib.parse.urlparse but idk if it's the best tool for this i will edit the post to add it

zaki98 Over a year ago

i think you can do something like this but i wouldn't recommend >>> url = "example.com/why-should-we-drink-water?".strip("?") >>> len(url) != len(url_query_cleaner(url))

|

Collectives™ on Stack Overflow

Drop requests that include query string in Scrapy

1 Answer 1

15 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

15 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related