1

I'm trying to scrape data-table from the web using scrapy selector but got an empty array. The funny thing is when I tried to save the file and scrape it I got the expected array (non-null). Information on Scrapy version, selector command, and expected response can be found below.

Scrapy Version

Scrapy  : 0.18.2
lxml    : 3.2.3.0
libxml2 : 2.9.0
Twisted : 13.1.0
Python  : 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)]
Platform: Windows-8-6.2.9200

selector

hxs.select('//table[contains(@class,"mnytbl")]//tbody//td[contains(@headers,"tbl\x34\x37a")]//span/text()').extract()

Expected Response

[u'\n1.26 Bil\n        \n', u'\n893.90 Mil\n        \n', u'\n924.87 Mil\n
 \n', u'\n1.18 Bil\n        \n', u'\n1.55 Bil\n        \n', u'\n2.91 Bil\n
  \n', u'\n3.96 Bil\n        \n', u'\n4.01 Bil\n        \n', u'\n3.35 Bil\n
   \n', u'\n2.36 Bil\n        \n']

<url>: http://investing.money.msn.com/investments/financial-statements?symbol=SPF

Shell Command to connect to the web

$ scrapy shell <url>

Running the selector on return an empty array ([]). If I save the html output into a file (e.g. C:\src.html) and use the selector I got the expected response.

Thx!

0

1 Answer 1

2

I understand you want to get the cells from the second column, the one with header "SALES"

I don't really know where your contains(@headers,"tbl\x34\x37a") predicate comes from, I think it may be due to dynamically generated "header" attributes for td.

I propose you try this rather scrary XPath expression

    //div[div[contains(span, "INCOME STATEMENT")]]
        //table[contains(@class,"mnytbl")]/tbody/tr
           /td[
               position() = (
                       count(../../../thead/tr/th[contains(., "SALES")]
                                        /preceding-sibling::th)
                       + 1
                   )
               ]

This borrows from Find position of a node using xpath to determine the position of an element

Explanations:

  • first find the first table: within a div that contains div, that contains a span with "INCOME STATEMENT"...
  • then find td cell, which position() is the same as the position of their cousin th cell with value "SALES"
  • ../../.. is to go from td back to grand-grand-parent table, this can be simplified by ancestor::table[1] (first table ancestor)

So to get the text elements inside the span in each 2nd cell of every row of the first table, that would be:

hxs.select("""
    //div[div[contains(span, "INCOME STATEMENT")]]
        //table[contains(@class,"mnytbl")]/tbody/tr
           /td[
               position() = (
                       count(ancestor::table[1]
                                 /thead/tr/th[contains(., "SALES")]
                                          /preceding-sibling::th)
                       + 1
                   )
               ]/span/text()
""").extract()
Sign up to request clarification or add additional context in comments.

1 Comment

Good to hear, @user1723988 ! you may accept the answer if you are happy with it, thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.