1

I am attempting to extract data from divs with scrapy for python 2. I now realize i cannot use a regex command like \d in my extracted div Xpath. how can i work around this? with \d{,2} i am trying to tell python "hey, there is supposed to be a number here with a value between 1-100" thanks in advance

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
import re

class MySpider(CrawlSpider):
    name = "craigs" #add the 's' to make functional = "craigs"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://philadelphia.craigslist.org/cta/"]

    rules = (Rule (SgmlLinkExtractor(allow=("index\d\d\d{,3}\.html", ),restrict_xpaths=    ('//*[@id="toc_rows"]/div[3]/div/div/span/a',))
, callback="parse_items", follow= True),
)

def parse_items(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select('//span[@class="pl"] | //span[@class="12"]')
    items = []

    for titles in titles:
        item = CraigslistSampleItem()
        item ["price"] = titles.select('//*[@id="toc_rows"]/div[2]/p[position() <=100])/span[3]/span[1]/text()').extract()
        item ["date"] = titles.select('//*[@id="toc_rows"]/div[2]/p[position() <=100]]/span[2]/span/text()').extract()
        item ["title"] = titles.select("a/text()").extract()
        item ["link"] = titles.select("a/@href").extract()
        items.append(item)
    return(items)

and the html snipet from the urls is this:

item ["date"] = span class="date">Jan 12/span>

item ["price"] = span class="price">$1950/span>

both exist under this parent ancestor node div id="toc_rows"

0

1 Answer 1

2

I assume p[\d{,2}] is meant to mean "the first two <p> elements".

This is done through position(): p[position() <= 2]. (Hint: position() counts from 1.)

Note that position() counts context-sensitively. If you select p elements, it counts them, not the number of elements in front of them.

<div>
  <p>First paragraph</p>     <!-- div/p[1]    or div/p[position() = 1] -->
  <div>Something else</div>  <!-- div/div/[1] or div/div[position() = 1] -->
  <p>Second paragraph</p>    <!-- div/p[2]    or div/p[position() = 2] -->

  <!-- div/p[position() <= 2] will select both <p> here -->
</div>

EDIT (after the question was modified). Here is what I would do:

  • First, select all rows: "//div[@id = 'toc_row']//div[@class = 'row']"
  • Then, for each row, select the...
    • price: "./span[@class = 'price']/text()"
    • date: "./span[@class = 'date']/text()"
    • title: "./span[@class = 'pl']/a/text()"
    • link: "./span[@class = 'pl']/a/@href"
Sign up to request clarification or add additional context in comments.

6 Comments

no the "p" is part of the div string. there are 100 values i need to extract. all the div strings look like p[1] all the way to p[100]. i am tring to tell pythong "hey, theres supposed to be a number here whose value is between 1-100" but the problem is \d{,} is a regex command that is encoded within an Xpath block. when i run the whole code, xpath is giving me invalid path error
The p is part of the div string? Include your HTML in the question (you should have done that from the start anyway).
whoops! new to the community. thanks for the replies <3.the entire code block has been pasted. its nothing too fancy
Well, welcome to SO. ;) - Including more of the python code was good, but please also include the HTML snippet in question (for archival reasons). Who knows how the linked page looks like in a month.
thanks tomalak. i've included what i think you might need to see? all i am trying to accomplish is letting python know that as it is making its way down the webpage, there will be 100 "p[]" and i want it to grab all of them. ive tried everything a noob can think of.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.