Trying to extract 'text' from a tag using Python

Question

I'm trying to extract the Proxy IP number on the first column in this page (https://www.proxynova.com/proxy-server-list/country-fr/), just the number for example: "178.33.62.155" but when I try to extract all the text content on the relevant tag , it doesn't get the Ip text.

The html tag on the website is:

<td align="left"><script>document.write('23178.3'.substr(2) + '3.62.155');</script>178.33.62.155</td>

Then I believe the Ip number above (after the tag script, inside the tag ) should appears when I print the text content, but it doesn't, following the code below I have done so far the only information that doesn't appears is the IP number.

Any idea on how to extract this specific Ip information and why it is not appearing when I extract all the text content of this tag?

from lxml import html
import requests
import re

page = requests.get('https://www.proxynova.com/proxy-server-list/country-fr/')
tree = html.fromstring(page.content.decode('utf-8'))

for elem in tree.xpath('//table[@class="table"]//tbody//td[@align="left"]'):
print elem.text_content()

Have you tried printing the tags in your loop, to make sure it's getting the right ones? — John Gordon
– John Gordon, Commented May 14, 2017 at 15:03
You mention a URL with country-fr in your question text but a URL with country-br in your code. Does it make a difference? — Bill Bell
– Bill Bell, Commented May 14, 2017 at 15:16
@BillBell No it doesn't make difference sorry, fr means Ips from France and br means Ips from brazil. I have fixed my comments. — Pablo
– Pablo, Commented May 14, 2017 at 15:28
@JohnGordon - Yes I did, it always returning the text: "document.write('23178.3'.substr(2) + '3.62.155');", but not "178.33.62.155". — Pablo
– Pablo, Commented May 14, 2017 at 15:38
Perhaps the problem is that 178.33.62.155 is not itself an element; it is the text content of the <td> element. — John Gordon
– John Gordon, Commented May 14, 2017 at 17:59

tell k · Accepted Answer · 2017-05-14 16:03:26Z

1

I recommend using BeautifulSoup. like this.

import requests
import re
from bs4 import BeautifulSoup

res = requests.get('https://www.proxynova.com/proxy-server-list/country-fr/')
soup = BeautifulSoup(res.content, "lxml")

REGEX_JS = re.compile("^document\.write\('([^']+)'\.substr\(2\) \+ '([^']+)'\);$")

proxy_ip_list = []
for table in soup.find_all("table", id="tbl_proxy_list"):
    for script in table.find_all("script"):
        m = REGEX_JS.search(script.text)
        if m:
            proxy_ip_list.append(m.group(1)[2:] + m.group(2))

for ip in proxy_ip_list:
    print(ip)

answered May 14, 2017 at 16:03

tell k

6152 gold badges7 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Bill Bell Over a year ago

Your answer helped me get my answer.

Pablo Over a year ago

Perfect, that is what I was trying to do. Thank you very much!!

Bill Bell · Accepted Answer · 2017-05-14 17:26:51Z

1

I admit that I wouldn't have got this without tell's answer because I missed how the IP addresses were coded in the scripts.

import re
import requests
from lxml import etree

page = requests.get('https://www.proxynova.com/proxy-server-list/country-fr/').text
parser = etree.HTMLParser()
tree = etree.fromstring(page, parser=parser)
table = tree.xpath('.//table[@id="tbl_proxy_list"]//script/text()')

for item in table:
    m = re.match(r"document.write\('23([0-9.]+)'[^']+'([0-9.]+)'",item)
    if m:
        print (''.join(m.groups()))

answered May 14, 2017 at 17:26

Bill Bell

21.7k6 gold badges48 silver badges62 bronze badges

1 Comment

Pablo Over a year ago

Thank you very much, it also helped me to understand how to do it using a different approach.

Collectives™ on Stack Overflow

Trying to extract 'text' from a tag using Python

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related