Scraping listed HTML values using scrapy

Question

I can't seem to figure out how to construct this xpath selector. I have even tried using nextsibling::text but to no avail. I have also browsed stackoverflow questions for scraping listed values but could not implement it correctly. I keep getting blank results. Any and all help would be appreciated. Thank you.

The website is https://www.unegui.mn/adv/5737502_10-r-khoroolold-1-oroo/.

Expected Results:

Woods

2015

Current Results:

blank

Current: XPath scrapy code:

list_li = response.xpath(".//ul[contains(@class, 'chars-column')]/li/text()").extract()

list_li = response.xpath("./ul[contains(@class,'value-chars')]//text()").extract()

floor_type = list_li[0].strip() commission_year = list_li[1].strip()

HTML Snippet:

<div class="announcement-characteristics clearfix">
  <ul class="chars-column">
    <li class="">
      <span class="key-chars">Flooring:</span>
      <span class="value-chars">Wood</span></li>
    <li class="">
      <span class="key-chars">Commission year:</span>
      <a href="https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/1-r/ashon_min---2011/"class="value-chars">2015</a>
    </li>
  </ul>
</div>

FURTHER CLARIFICATION: I previously did two selectors (one for the span list, one for the href list), but the problem was some pages on the website dont follow the same span list/a list order (i.e. on one page the table value would be in a span list, but some other page it would be in a href list). That is why I have been trying to only use one selector and get all the values.

This results in values as shown below in the image. Instead of the number of window aka an integer being scraped, it scrapes the address because on some pages the table value is under the href list not under the span list.

Previous 2 selectors:

list_span = response.xpath(".//span[contains(@class,'value-chars')]//text()").extract()

list_a = response.xpath(".//a[contains(@class,'value-chars')]//text()").extract()

Whole Code (if someone needs it to test it):

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from datetime import datetime
from scrapy.crawler import CrawlerProcess
from selenium import webdriver


dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' UB HPI Buying Data'

# create Spider class
class UneguiApartmentsSpider(scrapy.Spider):
    name = "unegui_apts"
    allowed_domains = ["www.unegui.mn"]
    custom_settings = {
        "FEEDS": {
            f'{filename}.csv': {
                'format': 'csv',
                'overwrite': True}}
    }

    # function used for start url
    def start_requests(self):
        urls = ['https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/ulan-bator/']
        for url in urls:
            yield Request(url, self.parse)

    def parse(self, response, **kwargs):
        cards = response.xpath("//li[contains(@class,'announcement-container')]")

        # parse details
        for card in cards:
            name = card.xpath(".//a[@itemprop='name']/@content").extract_first().strip()
            price = card.xpath(".//*[@itemprop='price']/@content").extract_first().strip()
            rooms = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__breadcrumbs')]/span[2]/text())").extract_first().strip()
            link = card.xpath(".//a[@itemprop='url']/@href").extract_first().strip()
            date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
            date = date_block[0].strip()
            city = date_block[1].strip()

            item = {'name': name,
                    'date': date,
                    'rooms': rooms,
                    'price': price,
                    'city': city,
                    }
            # follow absolute link to scrape deeper level
            yield response.follow(link, callback=self.parse_item, meta={'item': item})

        # handling pagination
        next_page = response.xpath("//a[contains(@class,'number-list-next js-page-filter number-list-line')]/@href").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
            print(f'Scraped {next_page}')

    def parse_item(self, response):
        # retrieve previously scraped item between callbacks
        item = response.meta['item']

        # parse additional details
        list_li = response.xpath(".//*[contains(@class, 'value-chars')]/text()").extract()

        # get additional details from list of <span> tags, element by element
        floor_type = list_li[0].strip()
        num_balcony = list_li[1].strip()
        commission_year = list_li[2].strip()
        garage = list_li[3].strip()
        window_type = list_li[4].strip()
        num_floors = list_li[5].strip()
        door_type = list_li[6].strip()
        area_sqm = list_li[7].strip()
        floor = list_li[8].strip()
        leasing = list_li[9].strip()
        district = list_li[10].strip()
        num_window = list_li[11].strip()
        address = list_li[12].strip()

        #list_span = response.xpath(".//span[contains(@class,'value-chars')]//text()").extract()
        #list_a = response.xpath(".//a[contains(@class,'value-chars')]//text()").extract()

        # get additional details from list of <span> tags, element by element
        #floor_type = list_span[0].strip()
        #num_balcony = list_span[1].strip()
        #garage = list_span[2].strip()
        #window_type = list_span[3].strip()
        #door_type = list_span[4].strip()
        #num_window = list_span[5].strip()

        # get additional details from list of <a> tags, element by element
        #commission_year = list_a[0].strip()
        #num_floors = list_a[1].strip()
        #area_sqm = list_a[2].strip()
        #floor = list_a[3].strip()
        #leasing = list_a[4].strip()
        #district = list_a[5].strip()
        #address = list_a[6].strip()

        # update item with newly parsed data
        item.update({
            'district': district,
            'address': address,
            'area_sqm': area_sqm,
            'floor': floor,
            'commission_year': commission_year,
            'num_floors': num_floors,
            'num_windows': num_window,
            'num_balcony': num_balcony,
            'floor_type': floor_type,
            'window_type': window_type,
            'door_type': door_type,
            'garage': garage,
            'leasing': leasing
        })
        yield item

        def __init__(self):
            self.driver = webdriver.Firefox()

            def parse_item2(self, response):
                self.driver.get(response.url)

                while True:
                    next = self.driver.find_element_by_xpath(".//span[contains(@class,'phone-author__title')]//text()")
                    try:
                        next.click()
                        # get the data and write it to scrapy items
                    except:
                        break
                self.driver.close()


# main driver
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(UneguiApartmentsSpider)
    process.start()

I suppose the because of invalid HTML (some span-elements are not closed) normal xpath's are not possible. This dis gave me results: ".//*[contains(@class,'value-chars')]" — Siebe Jongebloed
– Siebe Jongebloed, Commented Feb 28, 2022 at 9:33
Thank you. I just did get all .//* as you put and it seems to be working. Trying it on a larger dataset. I want to select you as answer, but you only added as a comment. — WX1505
– WX1505, Commented Feb 28, 2022 at 9:40

Ahmed Buksh · Accepted Answer · 2022-02-28 09:25:29Z

1

You need two selectors, one will pass keys and another one will parse values. This will result in two lists that can be zipped together in order to give you the results you are looking for.

CSS Selectors could be like:

Keys Selector --> .chars-column li .key-chars Values Selector --> .chars-column li .value-chars

Once you extract both lists, you can zip them and consume them as key value.

answered Feb 28, 2022 at 9:25

Ahmed Buksh

1618 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

WX1505 Over a year ago

I previously did two selectors (one for the span list, one for the href list), but the problem was some pages on the website dont follow the same span list/a list order (i.e. on one page the table value would be in a span list, but some other page it would be in a href list). That is why I have been trying to only use one selector and get all the values.

WX1505 Over a year ago

I updated my post with pictures and my previous 2 selectors.

Ahmed Buksh Over a year ago

Can you add the links for the webpages as well?

WX1505 Over a year ago

Hi. This is the link. I think its because on some pages it has different table/list headers, values, or ordering. Not sure though. The link is: unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna

Siebe Jongebloed · Accepted Answer · 2022-04-08 13:21:20Z

0

I suppose this is because of invalid HTML (some span-elements are not closed) normal xpath's are not possible.

This did gave me results:

".//*[contains(@class,'value-chars')]"

The * means any element, so it will select both select

<span class="value-chars">Wood</span>

and

<a href="https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/1-r/ashon_min---2011/"class="value-chars">2015</a>

edited Apr 8, 2022 at 13:21

answered Feb 28, 2022 at 9:45

Siebe Jongebloed

4,9252 gold badges17 silver badges27 bronze badges

2 Comments

WX1505 Over a year ago

Okay. I tested it and it gave me the same results as 2 separate selectors. I dont know why.

Siebe Jongebloed Over a year ago

I added some explanation. Does this answers your question?

Krupal Vaghasiya · Accepted Answer · 2022-02-28 09:25:42Z

-1

Use this XPath to get Wood

//*[@class="chars-column"]//span[2]//text()

Use this XPath to get 2015

//*[@class="chars-column"]//a[text()="2015"]

answered Feb 28, 2022 at 9:25

Krupal Vaghasiya

5388 silver badges26 bronze badges

2 Comments

WX1505 Over a year ago

I previously did two selectors (one for the span list, one for the href list), but the problem was some pages on the website dont follow the same span list/a list order (i.e. on one page the table value would be in a span list, but some other page it would be in a href list). That is why I have been trying to only use one selector and get all the values.

WX1505 Over a year ago

I updated my post with pictures and my previous 2 selectors.

Collectives™ on Stack Overflow

Scraping listed HTML values using scrapy

3 Answers 3

4 Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related