1

I am using python scrapy to scrape some data from a website.

the web site content is something like this

 <html>
  <div class="details">
  <div class="a"> not needed</div>
  content 1
  <p>content 2</p>
  <div>content 2</div>
  <p>content 2</p>
  <div>content 2</div>
  <p>content 2</p>
  <div class="b"> this is also not needed</div>
  </div>
 </html>

I need to get the full html data excluding div with class a,b.

so my output will be like this

<div class="details">   
content 1
<p>content 2</p>
<div>content 2</div>
<p>content 2</p>
<div>content 2</div>
<p>content 2</p>
</div>

How can I write correct xpath for that or should I write xpath for div with class 'details','a','b 'and use string operations to remove div with class 'a','b'?

Note that here content is the text of and is not a child of div with class 'details'

1 Answer 1

4

You can get all children except the div with class a or b using node() and self:: syntax:

//div[@class="details"]/node()[not(self::div[@class="a" or @class="b"])]

Demo using scrapy shell:

$ scrapy shell index.html
>>> nodes = response.xpath('//div[@class="details"]/node()[not(self::div[@class="a" or @class="b"])]').extract()
>>> print ''.join(nodes)
  content 1
  <p>content 2</p>
  <div>content 2</div>
  <p>content 2</p>
  <div>content 2</div>
  <p>content 2</p>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.