When downloading csv files from webpage getting missing url scheme

Question

I am fairly new to scraping and have been trying to directly download .csv files from a website. I managed to fix my last issue with the edit, however I get a new error when trying to download the files. The follow error is:

raise ValueError(f'Missing scheme in request url: {self._url}') ValueError: Missing scheme in request url: h

I am not sure what is triggering this error because the links follow properly to the next function.

For example, here is what I have tried:

import scrapy
from nhs.items import DownfilesItem

class NhsScapeSpider(scrapy.Spider):
    name = 'nhs_scape'
    #allowed_domains = ['nh']
    start_urls = ['https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/ae-attendances-and-emergency-admissions-2021-22/']

    custom_settings = {
        'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
    }

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url = url,
                callback = self.parse
            )

    def parse(self, response):
        side_panel = response.xpath("//aside[@class='subnav group minimal_nav desktop-only']//ul[@class='children']//li")
        for years in side_panel:
            year_links = years.xpath('.//a/@href').get()
            yield response.follow(year_links, callback = self.download_files)


    def download_files(self, response):
        test_files = response.xpath("//article[@class='rich-text']//p")
        month_files = response.xpath("//article[@class='rich-text']//h3")
        
        for files, mn in zip(test_files, month_files):
            all_files = files.xpath('.//a//@href').getall()
            all_file_names = files.xpath('.//a//text()').getall()
            month_year = mn.xpath('.//text()').get()

            for ind_files,ind_text in zip(all_files, all_file_names):
                item = DownfilesItem()

                if '.xls' in ind_files and 'Monthly' in ind_text:
                    item['file_urls'] = ind_files
                    item['original_file_name'] = ind_text
                    yield item

                elif '.xls' in ind_files and 'Week' in ind_text:
                    item['file_urls'] = ind_files
                    item['original_file_name'] = ind_text
                    yield item

Items.py:

import scrapy
class DownfilesItem(scrapy.Item):
    
    # define the fields for your item here like:
    file_urls = scrapy.Field()
    original_file_name = scrapy.Field()

Pipelines.py:

from scrapy.pipelines.files import FilesPipeline
class DownfilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        file_name: str = request.url.split("/")[1]
        return file_name

Settings.py:

ITEM_PIPELINES = {'nhs.pipelines.DownfilesPipeline': 150}
FILES_STORE = "Files"

Updated error after @supersuers answer:

IsADirectoryError: [Errno 21] Is a directory: 'Files/'

It seems this is caused by FILES_STORE = "Files", however when I remove this I do not get an error but no files are downloaded neither.

SuperUser · Accepted Answer · 2022-07-01 08:16:32Z

1

item['file_urls'] should be a list:

if '.xls' in ind_files and 'Monthly' in ind_text:
    item['file_urls'] = [ind_files]
    item['original_file_name'] = ind_text
    yield item

elif '.xls' in ind_files and 'Week' in ind_text:
    item['file_urls'] = [ind_files]
    item['original_file_name'] = ind_text
    yield item

Edit:

The second error is because of the pipeline, file_name is an empty string, you can change it for example to:

file_name: str = request.url.split("/")[-1]

Edit 2:

I think that the problem is in the xpath selectors, try this and tweak it to your needs:

import scrapy
from tempbuffer.items import DownfilesItem


class NhsScapeSpider(scrapy.Spider):
    name = 'nhs_scape'
    #allowed_domains = ['nh']
    start_urls = ['https://www.england.nhs.uk/statistics/statistical-work-areas/ae-waiting-times-and-activity/ae-attendances-and-emergency-admissions-2021-22/']

    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    }

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse
            )

    def parse(self, response):
        side_panel = response.xpath("//aside[@class='subnav group minimal_nav desktop-only']//ul[@class='children']//li")
        for years in side_panel:
            year_links = years.xpath('.//a/@href').get()
            yield response.follow(year_links, callback=self.download_files)

    def download_files(self, response):
        # test_files = response.xpath("//article[@class='rich-text']//p")
        test_files = response.xpath("//article[@class='rich-text']//p[a[contains(@href, '.xls')]]")
        # month_files = response.xpath("//article[@class='rich-text']//h3")
        # couldn't make a prettier xpath selector
        month_files = response.xpath("//article[@class='rich-text']//h3[starts-with(text(), 'January') or starts-with(text(), 'February') or starts-with(text(), 'March') or starts-with(text(), 'April') or starts-with(text(), 'May') or starts-with(text(), 'June') or starts-with(text(), 'July') or starts-with(text(), 'August') or starts-with(text(), 'September') or starts-with(text(), 'October') or starts-with(text(), 'November') or starts-with(text(), 'December')]")

        for files, mn in zip(test_files, month_files):
            all_files = files.xpath('.//a//@href').getall()
            all_file_names = files.xpath('.//a//text()').getall()
            month_year = mn.xpath('.//text()').get()

            for ind_files, ind_text in zip(all_files, all_file_names):
                item = DownfilesItem()

                if '.xls' in ind_files and 'Monthly' in ind_text:
                    item['file_urls'] = [ind_files]
                    item['original_file_name'] = ind_text
                    yield item

                elif '.xls' in ind_files and 'Week' in ind_text:
                    item['file_urls'] = [ind_files]
                    item['original_file_name'] = ind_text
                    yield item

edited Jul 1, 2022 at 8:16

answered Jun 30, 2022 at 21:33

SuperUser

4,8221 gold badge8 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Emil11 Over a year ago

Ah I see, this stops the error above however I get a new error which prevents files from being downloaded. I have updated the post with the error

SuperUser Over a year ago

@Emil11 see the edit

Emil11 Over a year ago

Ah splendid! this all works, although I am facing an issue with the files collected. I seem to only be collecting for files up to 2015, anything later is not downloaded. I checked year_links, and it goes to each year, and the xpath //article[@class='rich-text']//p does collect all the info needed. But these seem to be missing in the download, might you have a suggestion for this? It seems the Monthly wont download but the Week will

SuperUser Over a year ago

@Emil11 see edit2

Collectives™ on Stack Overflow

When downloading csv files from webpage getting missing url scheme

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related