0

I'm scraping Google Playstore. I've an HTML text(user's comments) as follow:-

<div class="quoted-review">
    <div class="review-text"> <span class="review-title">Awesome :)</span> Trying to learn some basic Lithuanian and pictures are very helpful. I'd love to learn more from who created this app.. &amp;lt;3
        <div class="paragraph-end details-light"></div>
    </div>
</div>

I want to extract the complete text inside class quoted-review using XPath, ie Awesome :). Trying to learn some basic Lithuanian and pictures are very helpful. I'd love to learn more from who created this app.. &lt;3.

Following are my xPath

1) //div[@class='quoted-review review-text']/span[@class='review-title']/text()|//div[@class='quoted-review review-text']/text()

yields a list

[
'Awesome :)' , 
'Trying to learn some basic Lithuanian and pictures are very helpful. I'd love to learn more from who created this app..'
]

I want both of them as one item. PS: Please do not advice me to concatenate index 0 and 1 using a for loop. I want them to extract them as one directly using Xpath.

2) //div[@class='review-text']/text() yields only

[
'Trying to learn some basic Lithuanian and pictures are very helpful. I'd love to learn more from who created this app..'
]

Awesome :) is missed.

I'm able to get it through BeautifulSoup using soup.select('.quoted-review')[1].getText() directly as one, but not using Xpath.

What wrong am I doing?

1
  • What do you use to execute the XPath, lxml? Commented Apr 2, 2016 at 10:16

1 Answer 1

1

In XPath version 1.0 (version that lxml implements), you can call XPath string() function to return effective string value of an element like so :

string(//div[@class='review-text'])

Notice that in case the inner XPath returns multiple elements, only the first will be considered. To support multiple elements correctly you'll need to incorporate some python codes, for example :

result = [div.xpath('string()') for div in \
            root.xpath('//div[@class='review-text']')]

Just for your information, XPath 2.0 supports invoking string() after path separator so you can do this using pure XPath :

//div[@class='review-text']/string()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.