Python2 Scrape html with xpath

Question

Consider a html page with 3 tables in it.

I want to loop through each table and at the same time to print something along if the content coresponds to something I want.

I need to keep track of the table I'm at.

As you see in the code below I have the page variable which is a html string.

I can return the content in all the tables at once(in an array).

I'd like to loop through them.

import __future__
from lxml import html
import requests
from bs4 import BeautifulSoup

page = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>cv</title>
</head>
<body>

    <table>
        <tr>
            <td>table1 td1</td>
            <td>table1 td2</td>
        </tr>
    </table>

    <table>
        <tr>
            <td>table2 td1</td>
            <td>table2 td2</td>
        </tr>
    </table>

    <table>
        <tr>
            <td>table3 td1</td>
            <td>table3 td2</td>
        </tr>
    </table>

</body>
</html>
"""

soup = str(BeautifulSoup(page, 'html.parser'))

tree = html.fromstring(soup)

tds = tree.xpath('//table/tr/td/text()')

for td in tds:
    print(td + '\n')

print('Ready !!')

mata · Accepted Answer · 2016-05-01 19:36:23Z

1

You mean you need to process each table on its own?

for table in tree.xpath(".//table"):
    print("---  new table: ---")
    for td in table.xpath(".//td"):
        print(td)

answered May 1, 2016 at 19:36

mata

69.4k10 gold badges168 silver badges162 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Catalin Over a year ago

It works, but what does that dot in front of the slashes mean? .// If I don't use it, the output is not desired.

mata Over a year ago

it means that the xpath expression is evaluated relative to the current context node, .//td means to search for all descendants of the current node (it's a short form of descendant::td). An expression starting with a / always starts at the root node of the document, so //td selects all td nodes in the document, regardless of the context node.

Catalin Over a year ago

Now I have another html document which has a lot of children and sub-chidlren elements. I want to write the abosulte path to it, but I can't write all the sub-children because they are too many.. Consider the example above, and imagine the <td> tags would have had a lot of parents/grand-parents... What would be the correct xpath to each td: xpath("/table.//td") - is not working. How can I retrieve the td from the context

mata Over a year ago

You should ask a new question with the exact problem description and example input / expected output, comments are not the right place to ask new questions.

Collectives™ on Stack Overflow

Python2 Scrape html with xpath

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related