1

I got this HTML (simplified):

<td class="pad10">
  <div class="button-left" style="margin-bottom: 4px">04.09.2013</div>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <div class="button-left" style="margin-bottom: 4px">05.10.2013</div>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
  <table width="100%" class="record generic schedule margin-4" </table>
</td>

I want to get dict structure which contains (row means table content separated by dates in main table):

{'04.09.2013': [1 row, 2 row],

 '05.10.2013': [1 row, 2 row, 3 row, 4 row]}

I can extract all 'div' with:

dt = s.xpath('//div[contains(@class, "button-left")]')

I can extract all 'table' with:

tables = s.xpath('//table[contains(@class, "record generic schedule margin-4")]')

But I don't know how to link 'dt' with corresponding 'tables' in Scrapy parser. It's possible to create a condition on scraping process, like this: if you found 'div' then you extract all next 'table' till you found other 'div'?

With Chrome i get two xPath examples of these elements:

//[@id="wrap"]/table/tbody/tr/td/table[3]/tbody/tr/td/div[2]
//[@id="wrap"]/table/tbody/tr/td/table[3]/tbody/tr/td/table[1]

Maybe it will help to image full structure of table.

Solution (thanks to @marven):

    s = Selector(response)

    table = {}
    current_key = None
    for e in s.xpath('//td[@class="pad10"]/*') :

        if bool(int(e.xpath('@class="button-left"').extract()[0])):
            current_key  = e.xpath('text()').extract()[0]
        else:
            if bool(int(e.xpath('@class="record generic schedule margin-4"').extract()[0])):
               t = e.extract()
               if current_key in table:
                   table[current_key].append(t)
               else:
                   table[current_key] = [t]
            else:
                pass
1
  • Let's add my goal. I want to parse all scheduler and save it to database. Link:eurobasket2013.org/en/… Commented Aug 1, 2014 at 23:30

3 Answers 3

0

What you can do is select all of the nodes and loop through them while checking whether the current node is a div or a table.

Using this as my test case,

<div class="asdf">
  <div class="button-left" style="margin-bottom: 4px">04.09.2013</div>
  <table width="100%" class="record generic schedule margin-4">1</table>
  <table width="100%" class="record generic schedule margin-4">2</table>
  <div class="button-left" style="margin-bottom: 4px">05.10.2013</div>
  <table width="100%" class="record generic schedule margin-4">3</table>
  <table width="100%" class="record generic schedule margin-4">4</table>
  <table width="100%" class="record generic schedule margin-4">5</table>
  <table width="100%" class="record generic schedule margin-4">6</table>
</div>

I use the following to loop through the nodes and updating which div the current node is currently "under" in.

currdiv = None
mydict = {}
for e in sel.xpath('//div[@class="asdf"]/*'):
    if bool(int(e.xpath('@class="button-left"').extract()[0])):
        currdiv = e.xpath('text()').extract()[0]
        mydict[currdiv] = []
    elif currdiv is not None:
        mydict[currdiv] += e.xpath('text()').extract()

This results into:

{u'04.09.2013': [u'1', u'2'], u'05.10.2013': [u'3', u'4', u'5', u'6']}
Sign up to request clarification or add additional context in comments.

1 Comment

Your answer helped a lot. Thank you. I reached my goal to group tables to corresponding date.
0

With that particular format you could get do this:

Get parent table: t = s.xpath('//div[contains(@class, "button-left")]/..')

Get first div: t.xpath('/div[1]') -- you might have to use position()=1

Get first two rows: t.xpath('/table[position() < 3]')

Get second div: t.xpath('/div[2]')

Get the rest of tables: t.xpath('/table[position() > 2')

This is very brittle and if this html changes this code won't work. It was hard answering this with the simplified html that you supplied and without knowing whether or not this structure is static or if it will change in the future. I would've asked these things in comment but I don't have enough rep :P

sources:

How to read attribute of a parent node from a child node in XSLT

What is the xpath to select a range of nodes?

https://stackoverflow.com/a/2407881/2368836

2 Comments

It wont change. I will use it only once. Just parse and save to database parsed info. The easiest way for me would be just copy this HTML part and parse it with simple iteration over the nodes like XML file. But I want to use Scrapy and also to find elegant solution for this. In mind to use something like group all the siblings... but for now I need to get more information and practice of using xPath...
I added link to may question. You can see this table structure.
0

See if this approach is applicable for your case : XPATH get all nodes between text_1 and text_2

Using the same approach as in the linked question above, basically we can filter <table> to only those having preceding-sibling and following-sibling specific <div>. For example (using XPath criteria you've posted for getting the <table>s and the <div>s) :

//table
    [contains(@class, "record generic schedule margin-4")]
    [
        preceding-sibling::div[contains(@class, "button-left")] 
            and 
        following-sibling::div[contains(@class, "button-left")]
    ]

1 Comment

Good example, but didn't work. There are unequal number of <tables> between two <div class="button-left"> and after the last <div class="button-left"> there are few more <tables> node, but no <div class="button-left"> node in the end. So collecting <tables> between siblings will not help. I think what need to do is at first iteration get all <div class="button-left">, calculate something like coordination of each, and in the next iteration get all <table> nodes between two coordination. Maybe coordination could be specific xPath...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.