Scrapy xpath for nested elements

Question

I think I am using Scrapy wrong, but I am trying to use xpath to select only the text from the H2's on a page and strip out the inner tags.

eg.

<h2>Welcome to my <a href="#">page</a></h2>
<h2>Welcome to my Page</h2>

I have tried using //h2//text(), but it produces an array like this

item["h2s"] = response.xpath('//h2//text()').extract()

['Welcome to my',
'page',
'Welcome to my Page']

I have tried number of combinations and just can't seem to get an array like I want below

['Welcome to my page',
'Welcome to my Page']

Community · Accepted Answer · 2017-05-23 12:08:49Z

1

You may join all the text nodes for every h2:

In [1]: [''.join(h2.xpath(".//text()").extract()) for h2 in response.xpath("//h2")]
Out[1]: [u'Welcome to my page', u'Welcome to my Page']

This topic is also quite relevant:

CommunityBot

11 silver badge

answered Dec 27, 2016 at 2:24

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Fantastic, just tried it and worked perfectly :) thanks. It seems like quite a complex thing to do in Scrapy for something relatively simple.