24
<span class='python'>
  <a>google</a>
  <a>chrome</a>
</span>

I want to get chrome and have it working like this already.

q = item.findall('.//span[@class="python"]//a')
t = q[1].text # first element = 0

I'd like to combine it into a single XPath expression and just get one item instead of a list.
I tried this but it doesn't work.

t = item.findtext('.//span[@class="python"]//a[2]') # first element = 1

And the actual, not simplified, HTML is like this.

<span class='python'>
  <span>
    <span>
      <img></img>
      <a>google</a>
    </span>
    <a>chrome</a>
  </span>
</span>
4
  • 2
    Your expression .//span[@class="python"]//a[2] works for me. Commented Nov 7, 2010 at 13:42
  • Hmmm it seems I have a mistake somewhere, or the simplification of the actual HTML I posted is too simple. I'll try and then modify the question. Commented Nov 7, 2010 at 13:47
  • @pdnsk: Good question, +1. See my answer for an explanation and for a simple solution. :) Commented Nov 7, 2010 at 15:37
  • so glad you posted this question. Been trying to figure out a similar problem for about a day. Commented Jun 19, 2019 at 14:58

3 Answers 3

42

I tried this but it doesn't work.

t = item.findtext('.//span[@class="python"]//a[2]')

This is a FAQ about the // abbreviation.

.//a[2] means: Select all a descendents of the current node that are the second a child of their parent. So this may select more than one element or no element -- depending on the concrete XML document.

To put it more simply, the [] operator has higher precedence than //.

If you want just one (the second) of all nodes returned you have to use brackets to force your wanted precedence:

(.//a)[2]

This really selects the second a descendent of the current node.

For the actual expression used in the question, change it to:

(.//span[@class="python"]//a)[2]

or change it to:

(.//span[@class="python"]//a)[2]/text()
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for the explanation, but I have one question, or actually two. If there is only one matching element, will [2] throw an exception or return None? And do you know why this works with xpath but not findtext?
@pdnsk: My answer is pure XPath. I don't know Python.
I tried and it just returns no element, which is good because one reason why I wanted to avoid lists and have it in a single expression is to not have an additional check.
Been trying to figure out a similar answer for a full day. Thanks a ton for the help!
2

I'm not sure what the problem is...

>>> d = """<span class='python'>
...   <a>google</a>
...   <a>chrome</a>
... </span>"""
>>> from lxml import etree
>>> d = etree.HTML(d)
>>> d.xpath('.//span[@class="python"]/a[2]/text()')
['chrome']
>>>

Comments

2

From Comments:

or the simplification of the actual HTML I posted is too simple

You are right. What is the meaning of .//span[@class="python"]//a[2]? This will be expanded to:

self::node()
 /descendant-or-self::node()
  /child::span[attribute::class="python"]
   /descendant-or-self::node()
    /child::a[position()=2]

It will finaly select the second a child (fn:position() refers to the child axe). So, nothing will be select if your document is like:

<span class='python'> 
  <span> 
    <span> 
      <img></img> 
      <a>google</a><!-- This is the first "a" child of its parent --> 
    </span> 
    <a>chrome</a><!-- This is also the first "a" child of its parent --> 
  </span> 
</span> 

If you want the second of all descendants, use:

descendant::span[@class="python"]/descendant::a[2]

4 Comments

It works with xpath but not with findtext, and returns a list with one item.
@pdknsk: That's because this XPath expression return a node set result: it could be empty, it could be a singleton, it could be many spans with a "python" class an a second descendant... If you want the string value of the first of this results, use string() function with this expression as argument. I don't know what kind of data type can return your xpath method...
It works. I used a combination of the previous answer, with /text(), and this answer, but I'll accept this answer because it details the problem. I only have one question. What is the short equivalent to /descandant::?
@pdknsk: First, text() will return all the text node children. string() or the DOM method for string value will return the concatenation of all descendant text nodes. It's not the same. Second, there is no abbreviated form for descendant axe. My last expression it's equivalent to (.//span[@class="python"]//a)[2]? so the position() predicate gets applied to the whole expression not just last step.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.