How to scrape table with different xpath on the same level with Scrapy?

Question

I got this HTML (simplified):

<td class="pad10">
  <div class="button-left" style="margin-bottom: 4px">04.09.2013</div>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <div class="button-left" style="margin-bottom: 4px">05.10.2013</div>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
</td>

I want to get dict structure which contains (row means table content separated by dates in main table):

{'04.09.2013': [1 row, 2 row],

 '05.10.2013': [1 row, 2 row, 3 row, 4 row]}

I can extract all 'div' with:

dt = s.xpath('//div[contains(@class, "button-left")]')

I can extract all 'table' with:

tables = s.xpath('//table[contains(@class, "record generic schedule margin-4")]')

But I don't know how to link 'dt' with corresponding 'tables' in Scrapy parser. It's possible to create a condition on scraping process, like this: if you found 'div' then you extract all next 'table' till you found other 'div'?

With Chrome i get two xPath examples of these elements:

//[@id="wrap"]/table/tbody/tr/td/table[3]/tbody/tr/td/div[2]
//[@id="wrap"]/table/tbody/tr/td/table[3]/tbody/tr/td/table[1]

Maybe it will help to image full structure of table.

Solution (thanks to @marven):

    s = Selector(response)

    table = {}
    current_key = None
    for e in s.xpath('//td[@class="pad10"]/*') :

        if bool(int(e.xpath('@class="button-left"').extract()[0])):
            current_key  = e.xpath('text()').extract()[0]
        else:
            if bool(int(e.xpath('@class="record generic schedule margin-4"').extract()[0])):
               t = e.extract()
               if current_key in table:
                   table[current_key].append(t)
               else:
                   table[current_key] = [t]
            else:
                pass

Let's add my goal. I want to parse all scheduler and save it to database. Link:eurobasket2013.org/en/… — kepurlaukis
– kepurlaukis, Commented Aug 1, 2014 at 23:30

marven · Accepted Answer · 2014-08-02 12:38:21Z

0

What you can do is select all of the nodes and loop through them while checking whether the current node is a div or a table.

Using this as my test case,

<div class="asdf">
  <div class="button-left" style="margin-bottom: 4px">04.09.2013</div>
  <table width="100%" class="record generic schedule margin-4">1</table>
  <table width="100%" class="record generic schedule margin-4">2</table>
  <div class="button-left" style="margin-bottom: 4px">05.10.2013</div>
  <table width="100%" class="record generic schedule margin-4">3</table>
  <table width="100%" class="record generic schedule margin-4">4</table>
  <table width="100%" class="record generic schedule margin-4">5</table>
  <table width="100%" class="record generic schedule margin-4">6</table>
</div>

I use the following to loop through the nodes and updating which div the current node is currently "under" in.

currdiv = None
mydict = {}
for e in sel.xpath('//div[@class="asdf"]/*'):
    if bool(int(e.xpath('@class="button-left"').extract()[0])):
        currdiv = e.xpath('text()').extract()[0]
        mydict[currdiv] = []
    elif currdiv is not None:
        mydict[currdiv] += e.xpath('text()').extract()

This results into:

{u'04.09.2013': [u'1', u'2'], u'05.10.2013': [u'3', u'4', u'5', u'6']}

answered Aug 2, 2014 at 12:38

marven

1,8461 gold badge17 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kepurlaukis Over a year ago

Your answer helped a lot. Thank you. I reached my goal to group tables to corresponding date.

Community · Accepted Answer · 2017-05-23 12:04:59Z

0

With that particular format you could get do this:

Get parent table: t = s.xpath('//div[contains(@class, "button-left")]/..')

Get first div: t.xpath('/div[1]') -- you might have to use position()=1

Get first two rows: t.xpath('/table[position() < 3]')

Get second div: t.xpath('/div[2]')

Get the rest of tables: t.xpath('/table[position() > 2')

This is very brittle and if this html changes this code won't work. It was hard answering this with the simplified html that you supplied and without knowing whether or not this structure is static or if it will change in the future. I would've asked these things in comment but I don't have enough rep :P

sources:

How to read attribute of a parent node from a child node in XSLT

What is the xpath to select a range of nodes?

https://stackoverflow.com/a/2407881/2368836

edited May 23, 2017 at 12:04

CommunityBot

11 silver badge

answered Aug 1, 2014 at 20:57

rocktheartsm4l

2,18723 silver badges39 bronze badges

2 Comments

kepurlaukis Over a year ago

It wont change. I will use it only once. Just parse and save to database parsed info. The easiest way for me would be just copy this HTML part and parse it with simple iteration over the nodes like XML file. But I want to use Scrapy and also to find elegant solution for this. In mind to use something like group all the siblings... but for now I need to get more information and practice of using xPath...

kepurlaukis Over a year ago

I added link to may question. You can see this table structure.

Community · Accepted Answer · 2017-05-23 12:11:39Z

0

See if this approach is applicable for your case : XPATH get all nodes between text_1 and text_2

Using the same approach as in the linked question above, basically we can filter <table> to only those having preceding-sibling and following-sibling specific <div>. For example (using XPath criteria you've posted for getting the <table>s and the <div>s) :

//table
    [contains(@class, "record generic schedule margin-4")]
    [
        preceding-sibling::div[contains(@class, "button-left")] 
            and 
        following-sibling::div[contains(@class, "button-left")]
    ]

edited May 23, 2017 at 12:11

CommunityBot

11 silver badge

answered Aug 1, 2014 at 22:50

har07

89.5k12 gold badges87 silver badges143 bronze badges

1 Comment

kepurlaukis Over a year ago

Good example, but didn't work. There are unequal number of <tables> between two <div class="button-left"> and after the last <div class="button-left"> there are few more <tables> node, but no <div class="button-left"> node in the end. So collecting <tables> between siblings will not help. I think what need to do is at first iteration get all <div class="button-left">, calculate something like coordination of each, and in the next iteration get all <table> nodes between two coordination. Maybe coordination could be specific xPath...

Collectives™ on Stack Overflow

How to scrape table with different xpath on the same level with Scrapy?

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related