0

I'm trying to get proxy and port value from this http://jsbin.com/noxuqusoga/edit?html, output html page.

Here is a sample of the table structure from that page, including only one tr, but the actual HTML has many tr elements with similar structure:

<table class="table" id="tbl_proxy_list" width="950">
    <tbody>
        <tr data-proxy-id="1355950">
            <td align="left"><abbr title="103.227.175.125">103.227.175.125 </abbr></td>
            <td align="left"><a href="/proxy-server-list/port-8080/" title="Port 8080 proxies">8080</a></td>
            <td align="left"><time class="icon icon-check timeago" datetime="2018-08-18 04:56:47Z">9 min ago</time></td>
            <td align="left">
            <div class="progress-bar" data-value="22" title="1089">
            <div class="progress-bar-inner" style="width:22%; background-color: hsl(26.4,100%,50%);">&nbsp;</div>
            </div>
            <small>1089 ms</small></td>
            <td style="text-align:center !important;"><span style="color:#009900;">95%</span> <span> (94)</span></td>
            <td align="left"><img alt="sg" class="flag flag-sg" src="/assets/images/blank.gif" style="vertical-align: middle;" /> <a href="/proxy-server-list/country-sg/" title="Proxies from Singapore">Singapore <span class="proxy-city"> - Bukit Timah </span> </a></td>
            <td align="left"><span class="proxy_transparent" style="font-weight:bold; font-size:10px;">Transparent</span></td>
            <td><span>-</span></td>
        </tr>
  </tbody>
</table>

I'm able to scrap the proxy address but I have difficulties with the port as the <td> does not have an id or a class and as value some have hyperlinks, and others don't.

How can I make the result like --> ip:port for the whole scrap result.

Here's my code

$html = file_get_html('http://jsbin.com/noxuqusoga/');

// Find all images
foreach($html->find('abbr') as $element)
       echo $element->title . '<br>';

foreach($html->find('td a') as $element)
       echo $element->plaintext . '<br>';

Please help,
Thanks

2
  • Use DomNode->next_sibling to get next td and strip_tags to strip <a> tags. Commented Aug 18, 2018 at 5:47
  • That does not look like DOM but a PHP library called SimpleHTMLDOM Commented Aug 20, 2018 at 9:27

1 Answer 1

1

Instead of writing a selector for td elements (or elements inside them, like abbr or a) write a selector for their tr parent, then loop over these trs (rows) and for each row, get the children of that row which you need:

// Select all tr elements inside tbody
foreach ($html->find('tbody tr') as $row)
    // the second parameter (zero) indicates we only need the first element matching our selector

    // ip is in the first <abbr> element that is child of a td
    $ip = $row->find('td abbr', 0)->plaintext;
    // port is in the first <a> element that is child of a td
    $port = $row->find('td a', 0)->plaintext;
    print "$ip:$port\n";
}

As an alternative, you should know when selecting elements, besides using css selectors you also have the option to get elements by their index. In your case, what you want from each tr is in the first and the second td elements inside each tr element. So you can also find the first and the second child of each tr to extract the data.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.