XPath Error When Using "\d" to Extract Data from Divs with Scrapy for Python 2

Question

I am attempting to extract data from divs with scrapy for python 2. I now realize i cannot use a regex command like \d in my extracted div Xpath. how can i work around this? with \d{,2} i am trying to tell python "hey, there is supposed to be a number here with a value between 1-100" thanks in advance

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
import re

class MySpider(CrawlSpider):
    name = "craigs" #add the 's' to make functional = "craigs"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://philadelphia.craigslist.org/cta/"]

    rules = (Rule (SgmlLinkExtractor(allow=("index\d\d\d{,3}\.html", ),restrict_xpaths=    ('//*[@id="toc_rows"]/div[3]/div/div/span/a',))
, callback="parse_items", follow= True),
)

def parse_items(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select('//span[@class="pl"] | //span[@class="12"]')
    items = []

    for titles in titles:
        item = CraigslistSampleItem()
        item ["price"] = titles.select('//*[@id="toc_rows"]/div[2]/p[position() <=100])/span[3]/span[1]/text()').extract()
        item ["date"] = titles.select('//*[@id="toc_rows"]/div[2]/p[position() <=100]]/span[2]/span/text()').extract()
        item ["title"] = titles.select("a/text()").extract()
        item ["link"] = titles.select("a/@href").extract()
        items.append(item)
    return(items)

and the html snipet from the urls is this:

item ["date"] = span class="date">Jan 12/span>

item ["price"] = span class="price">$1950/span>

both exist under this parent ancestor node div id="toc_rows"

Tomalak · Accepted Answer · 2014-01-12 22:03:52Z

2

I assume p[\d{,2}] is meant to mean "the first two <p> elements".

This is done through position(): p[position() <= 2]. (Hint: position() counts from 1.)

Note that position() counts context-sensitively. If you select p elements, it counts them, not the number of elements in front of them.

<div>
  <p>First paragraph</p>     <!-- div/p[1]    or div/p[position() = 1] -->
  <div>Something else</div>  <!-- div/div/[1] or div/div[position() = 1] -->
  <p>Second paragraph</p>    <!-- div/p[2]    or div/p[position() = 2] -->

  <!-- div/p[position() <= 2] will select both <p> here -->
</div>

EDIT (after the question was modified). Here is what I would do:

First, select all rows: "//div[@id = 'toc_row']//div[@class = 'row']"
Then, for each row, select the...
- price: "./span[@class = 'price']/text()"
- date: "./span[@class = 'date']/text()"
- title: "./span[@class = 'pl']/a/text()"
- link: "./span[@class = 'pl']/a/@href"

edited Jan 12, 2014 at 22:03

answered Jan 12, 2014 at 21:03

Tomalak

339k68 gold badges547 silver badges635 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

skellyboy Over a year ago

no the "p" is part of the div string. there are 100 values i need to extract. all the div strings look like p[1] all the way to p[100]. i am tring to tell pythong "hey, theres supposed to be a number here whose value is between 1-100" but the problem is \d{,} is a regex command that is encoded within an Xpath block. when i run the whole code, xpath is giving me invalid path error

Tomalak Over a year ago

The p is part of the div string? Include your HTML in the question (you should have done that from the start anyway).

skellyboy Over a year ago

whoops! new to the community. thanks for the replies <3.the entire code block has been pasted. its nothing too fancy

Tomalak Over a year ago

Well, welcome to SO. ;) - Including more of the python code was good, but please also include the HTML snippet in question (for archival reasons). Who knows how the linked page looks like in a month.

skellyboy Over a year ago

thanks tomalak. i've included what i think you might need to see? all i am trying to accomplish is letting python know that as it is making its way down the webpage, there will be 100 "p[]" and i want it to grab all of them. ive tried everything a noob can think of.

|

Collectives™ on Stack Overflow

XPath Error When Using "\d" to Extract Data from Divs with Scrapy for Python 2

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related