xpath select node texts and child nodes

Question

I am using python scrapy to scrape some data from a website.

the web site content is something like this

 <html>
  <div class="details">
  <div class="a"> not needed</div>
  content 1
  <p>content 2</p>
  <div>content 2</div>
  <p>content 2</p>
  <div>content 2</div>
  <p>content 2</p>
  <div class="b"> this is also not needed</div>
  </div>
 </html>

I need to get the full html data excluding div with class a,b.

so my output will be like this

<div class="details">   
content 1
<p>content 2</p>
<div>content 2</div>
<p>content 2</p>
<div>content 2</div>
<p>content 2</p>
</div>

How can I write correct xpath for that or should I write xpath for div with class 'details','a','b 'and use string operations to remove div with class 'a','b'?

Note that here content is the text of and is not a child of div with class 'details'

alecxe · Accepted Answer · 2014-11-24 05:14:14Z

4

You can get all children except the div with class a or b using node() and self:: syntax:

//div[@class="details"]/node()[not(self::div[@class="a" or @class="b"])]

Demo using scrapy shell:

$ scrapy shell index.html
>>> nodes = response.xpath('//div[@class="details"]/node()[not(self::div[@class="a" or @class="b"])]').extract()
>>> print ''.join(nodes)
  content 1
  <p>content 2</p>
  <div>content 2</div>
  <p>content 2</p>
  <div>content 2</div>
  <p>content 2</p>

edited Nov 24, 2014 at 5:14

answered Nov 24, 2014 at 5:09

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

xpath select node texts and child nodes

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related