1

I'm trying to extract the Proxy IP number on the first column in this page (https://www.proxynova.com/proxy-server-list/country-fr/), just the number for example: "178.33.62.155" but when I try to extract all the text content on the relevant tag , it doesn't get the Ip text.

The html tag on the website is:

<td align="left"><script>document.write('23178.3'.substr(2) + '3.62.155');</script>178.33.62.155</td>

Then I believe the Ip number above (after the tag script, inside the tag ) should appears when I print the text content, but it doesn't, following the code below I have done so far the only information that doesn't appears is the IP number.

Any idea on how to extract this specific Ip information and why it is not appearing when I extract all the text content of this tag?

from lxml import html
import requests
import re

page = requests.get('https://www.proxynova.com/proxy-server-list/country-fr/')
tree = html.fromstring(page.content.decode('utf-8'))

for elem in tree.xpath('//table[@class="table"]//tbody//td[@align="left"]'):
print elem.text_content()
5
  • Have you tried printing the tags in your loop, to make sure it's getting the right ones? Commented May 14, 2017 at 15:03
  • You mention a URL with country-fr in your question text but a URL with country-br in your code. Does it make a difference? Commented May 14, 2017 at 15:16
  • @BillBell No it doesn't make difference sorry, fr means Ips from France and br means Ips from brazil. I have fixed my comments. Commented May 14, 2017 at 15:28
  • @JohnGordon - Yes I did, it always returning the text: "document.write('23178.3'.substr(2) + '3.62.155');", but not "178.33.62.155". Commented May 14, 2017 at 15:38
  • Perhaps the problem is that 178.33.62.155 is not itself an element; it is the text content of the <td> element. Commented May 14, 2017 at 17:59

2 Answers 2

1

I recommend using BeautifulSoup. like this.

import requests
import re
from bs4 import BeautifulSoup

res = requests.get('https://www.proxynova.com/proxy-server-list/country-fr/')
soup = BeautifulSoup(res.content, "lxml")

REGEX_JS = re.compile("^document\.write\('([^']+)'\.substr\(2\) \+ '([^']+)'\);$")

proxy_ip_list = []
for table in soup.find_all("table", id="tbl_proxy_list"):
    for script in table.find_all("script"):
        m = REGEX_JS.search(script.text)
        if m:
            proxy_ip_list.append(m.group(1)[2:] + m.group(2))

for ip in proxy_ip_list:
    print(ip)
Sign up to request clarification or add additional context in comments.

2 Comments

Your answer helped me get my answer.
Perfect, that is what I was trying to do. Thank you very much!!
1

I admit that I wouldn't have got this without tell's answer because I missed how the IP addresses were coded in the scripts.

import re
import requests
from lxml import etree

page = requests.get('https://www.proxynova.com/proxy-server-list/country-fr/').text
parser = etree.HTMLParser()
tree = etree.fromstring(page, parser=parser)
table = tree.xpath('.//table[@id="tbl_proxy_list"]//script/text()')

for item in table:
    m = re.match(r"document.write\('23([0-9.]+)'[^']+'([0-9.]+)'",item)
    if m:
        print (''.join(m.groups()))

1 Comment

Thank you very much, it also helped me to understand how to do it using a different approach.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.