2

I'm trying to scrape a catalog id number from this page:

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

url = 'http://www.enciclovida.mx/busquedas/resultados?utf8=%E2%9C%93&busqueda=basica&id=&nombre=astomiopsis+exserta&button='

response = HtmlResponse(url=url)

using the css selector (which works in R with rvest::html_nodes)

".result-nombre-container > h5:nth-child(2) > a:nth-child(1)"

I would like to retrieve the catalog id, which in this case should be:

6011038

I'm ok if it is done easier with the xpath

1
  • can you post complete code that you are using. May be I can help. Commented Aug 12, 2018 at 4:55

3 Answers 3

1

I don't have scrapy here, but tested this xpath and it will get you the href:

//div[contains(@class, 'result-nombre-container')]/h5[2]/a/@href

If you're having too much trouble with scrapy and css selector syntax, I would also suggest trying out BeautifulSoup python package. With BeautifulSoup you can do things like

link.get('href')
Sign up to request clarification or add additional context in comments.

Comments

1

If you need to parse id from href:

catalog_id = response.xpath("//div[contains(@class, 'result-nombre-container')]/h5[2]/a/@href").re_first( r'(\d+)$' )

Comments

0

There seems to be only one link in the h5 element. So in short:

response.css('h5 > a::attr(href)').re('(\d+)$')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.