0

I am fairly new to scraping and have been trying to directly download .csv files from a website. I managed to fix my last issue with the edit, however I get a new error when trying to download the files. The follow error is:

raise ValueError(f'Missing scheme in request url: {self._url}') ValueError: Missing scheme in request url: h

I am not sure what is triggering this error because the links follow properly to the next function.

For example, here is what I have tried:

import scrapy
from nhs.items import DownfilesItem

class NhsScapeSpider(scrapy.Spider):
    name = 'nhs_scape'
    #allowed_domains = ['nh']
    start_urls = ['https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/ae-attendances-and-emergency-admissions-2021-22/']

    custom_settings = {
        'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
    }

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url = url,
                callback = self.parse
            )

    def parse(self, response):
        side_panel = response.xpath("//aside[@class='subnav group minimal_nav desktop-only']//ul[@class='children']//li")
        for years in side_panel:
            year_links = years.xpath('.//a/@href').get()
            yield response.follow(year_links, callback = self.download_files)


    def download_files(self, response):
        test_files = response.xpath("//article[@class='rich-text']//p")
        month_files = response.xpath("//article[@class='rich-text']//h3")
        
        for files, mn in zip(test_files, month_files):
            all_files = files.xpath('.//a//@href').getall()
            all_file_names = files.xpath('.//a//text()').getall()
            month_year = mn.xpath('.//text()').get()

            for ind_files,ind_text in zip(all_files, all_file_names):
                item = DownfilesItem()

                if '.xls' in ind_files and 'Monthly' in ind_text:
                    item['file_urls'] = ind_files
                    item['original_file_name'] = ind_text
                    yield item

                elif '.xls' in ind_files and 'Week' in ind_text:
                    item['file_urls'] = ind_files
                    item['original_file_name'] = ind_text
                    yield item

Items.py:

import scrapy
class DownfilesItem(scrapy.Item):
    
    # define the fields for your item here like:
    file_urls = scrapy.Field()
    original_file_name = scrapy.Field()

Pipelines.py:

from scrapy.pipelines.files import FilesPipeline
class DownfilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        file_name: str = request.url.split("/")[1]
        return file_name

Settings.py:

ITEM_PIPELINES = {'nhs.pipelines.DownfilesPipeline': 150}
FILES_STORE = "Files"

Updated error after @supersuers answer:

IsADirectoryError: [Errno 21] Is a directory: 'Files/'

It seems this is caused by FILES_STORE = "Files", however when I remove this I do not get an error but no files are downloaded neither.

1 Answer 1

1

item['file_urls'] should be a list:

if '.xls' in ind_files and 'Monthly' in ind_text:
    item['file_urls'] = [ind_files]
    item['original_file_name'] = ind_text
    yield item

elif '.xls' in ind_files and 'Week' in ind_text:
    item['file_urls'] = [ind_files]
    item['original_file_name'] = ind_text
    yield item

Edit:

The second error is because of the pipeline, file_name is an empty string, you can change it for example to:

file_name: str = request.url.split("/")[-1]

Edit 2:

I think that the problem is in the xpath selectors, try this and tweak it to your needs:

import scrapy
from tempbuffer.items import DownfilesItem


class NhsScapeSpider(scrapy.Spider):
    name = 'nhs_scape'
    #allowed_domains = ['nh']
    start_urls = ['https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/ae-attendances-and-emergency-admissions-2021-22/']

    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    }

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse
            )

    def parse(self, response):
        side_panel = response.xpath("//aside[@class='subnav group minimal_nav desktop-only']//ul[@class='children']//li")
        for years in side_panel:
            year_links = years.xpath('.//a/@href').get()
            yield response.follow(year_links, callback=self.download_files)

    def download_files(self, response):
        # test_files = response.xpath("//article[@class='rich-text']//p")
        test_files = response.xpath("//article[@class='rich-text']//p[a[contains(@href, '.xls')]]")
        # month_files = response.xpath("//article[@class='rich-text']//h3")
        # couldn't make a prettier xpath selector
        month_files = response.xpath("//article[@class='rich-text']//h3[starts-with(text(), 'January') or starts-with(text(), 'February') or starts-with(text(), 'March') or starts-with(text(), 'April') or starts-with(text(), 'May') or starts-with(text(), 'June') or starts-with(text(), 'July') or starts-with(text(), 'August') or starts-with(text(), 'September') or starts-with(text(), 'October') or starts-with(text(), 'November') or starts-with(text(), 'December')]")

        for files, mn in zip(test_files, month_files):
            all_files = files.xpath('.//a//@href').getall()
            all_file_names = files.xpath('.//a//text()').getall()
            month_year = mn.xpath('.//text()').get()

            for ind_files, ind_text in zip(all_files, all_file_names):
                item = DownfilesItem()

                if '.xls' in ind_files and 'Monthly' in ind_text:
                    item['file_urls'] = [ind_files]
                    item['original_file_name'] = ind_text
                    yield item

                elif '.xls' in ind_files and 'Week' in ind_text:
                    item['file_urls'] = [ind_files]
                    item['original_file_name'] = ind_text
                    yield item
Sign up to request clarification or add additional context in comments.

4 Comments

Ah I see, this stops the error above however I get a new error which prevents files from being downloaded. I have updated the post with the error
@Emil11 see the edit
Ah splendid! this all works, although I am facing an issue with the files collected. I seem to only be collecting for files up to 2015, anything later is not downloaded. I checked year_links, and it goes to each year, and the xpath //article[@class='rich-text']//p does collect all the info needed. But these seem to be missing in the download, might you have a suggestion for this? It seems the Monthly wont download but the Week will
@Emil11 see edit2

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.