1

I think I am using Scrapy wrong, but I am trying to use xpath to select only the text from the H2's on a page and strip out the inner tags.

eg.

<h2>Welcome to my <a href="#">page</a></h2>
<h2>Welcome to my Page</h2>

I have tried using //h2//text(), but it produces an array like this

item["h2s"] = response.xpath('//h2//text()').extract()

['Welcome to my',
'page',
'Welcome to my Page']

I have tried number of combinations and just can't seem to get an array like I want below

['Welcome to my page',
'Welcome to my Page']

1 Answer 1

1

You may join all the text nodes for every h2:

In [1]: [''.join(h2.xpath(".//text()").extract()) for h2 in response.xpath("//h2")]
Out[1]: [u'Welcome to my page', u'Welcome to my Page']

This topic is also quite relevant:

Sign up to request clarification or add additional context in comments.

1 Comment

Fantastic, just tried it and worked perfectly :) thanks. It seems like quite a complex thing to do in Scrapy for something relatively simple.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.