Python - Webscraping using XPath

Question

I'm scraping Google Playstore. I've an HTML text(user's comments) as follow:-

<div class="quoted-review">
    <div class="review-text"> <span class="review-title">Awesome :)</span> Trying to learn some basic Lithuanian and pictures are very helpful. I'd love to learn more from who created this app.. &amp;lt;3
        <div class="paragraph-end details-light"></div>
    </div>
</div>

I want to extract the complete text inside class quoted-review using XPath, ie Awesome :). Trying to learn some basic Lithuanian and pictures are very helpful. I'd love to learn more from who created this app.. <3.

Following are my xPath

1) //div[@class='quoted-review review-text']/span[@class='review-title']/text()|//div[@class='quoted-review review-text']/text()

yields a list

[
'Awesome :)' , 
'Trying to learn some basic Lithuanian and pictures are very helpful. I'd love to learn more from who created this app..'
]

I want both of them as one item. PS: Please do not advice me to concatenate index 0 and 1 using a for loop. I want them to extract them as one directly using Xpath.

2) //div[@class='review-text']/text() yields only

[
'Trying to learn some basic Lithuanian and pictures are very helpful. I'd love to learn more from who created this app..'
]

Awesome :) is missed.

I'm able to get it through BeautifulSoup using soup.select('.quoted-review')[1].getText() directly as one, but not using Xpath.

What wrong am I doing?

What do you use to execute the XPath, lxml?

har07
– har07

2016-04-02 10:16:44 +00:00
Commented Apr 2, 2016 at 10:16 — har07
– har07, Commented Apr 2, 2016 at 10:16

har07 · Accepted Answer · 2016-04-02 10:28:06Z

1

In XPath version 1.0 (version that lxml implements), you can call XPath string() function to return effective string value of an element like so :

string(//div[@class='review-text'])

Notice that in case the inner XPath returns multiple elements, only the first will be considered. To support multiple elements correctly you'll need to incorporate some python codes, for example :

result = [div.xpath('string()') for div in \
            root.xpath('//div[@class='review-text']')]

Just for your information, XPath 2.0 supports invoking string() after path separator so you can do this using pure XPath :

//div[@class='review-text']/string()

answered Apr 2, 2016 at 10:28

har07

89.5k12 gold badges87 silver badges143 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python - Webscraping using XPath

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related