0

Within a curl request I have a html table that has the below structure. I now want to extract only table rows that contain a span element with the empty class and not the ones with the class="subcomponent". I successfully tried Xpath to find the elements with the empty class but how to do I get the entire <tr> or even better specific <td> nodes that contain Version and Partnumber. Thanks in advance.

<table>
...
<tbody>
    <tr>
        <td></td>
        <td></td>
        <td>
            <span class="">Product</span>
        </td>
        <td>Version</td>
        <td>Partnumber</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td>
            <span class="subcomponent">Component</span>
        </td>
        <td>Version</td>
        <td>Partnumber</td>
    </tr>
</tbody>

My PHP code

$doc = new DOMdocument();
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$doc->saveHTML();
$xpath = new DOMXpath($doc);
$query ='//span[@class=""]';
$entries = $xpath->query($query);

foreach ($entries as $entry) {
    echo $entry->C14N();
}

2 Answers 2

2

To access the table rows themselves using SimpleXML, you can use the following:

$sxml = simplexml_load_string('<table>...</table>');

$rows = $sxml->xpath('//tr[td/span[@class=""]]');

foreach ($rows as $row) {
  echo "Version: ", $row->td[3], ", Partnumber: ", $row->td[4];
}

The XPath works by selecting all <tr> tags that have a child <td>, which itself has a child <span> with a blank class.

In the loop, you need to access the child cells of each row by number, since your sample doesn't indicate that they're labelled any other way. I'm assuming a table structure won't change too often though, so that should be fine.

See https://eval.in/860169 for an example.

Alternative DOMDocument Version

If you're fetching a full webpage, which won't necessarily be well-formed, you might need to use DOMDocument as you have in your first example. It's a bit less clean to access the child-elements, but something like the following will work:

$doc = new DOMdocument;
libxml_use_internal_errors(true);
$doc->loadHTML($page);
$xpath = new DOMXpath($doc);
$rows = $xpath->query('//tr[td/span[@class=""]]');

foreach ($rows as $row) {
    $cells = $row->getElementsByTagName('td');

    $version = $cells->item(3)->nodeValue;
    $partNumber = $cells->item(4)->nodeValue;

    echo "Version: {$version}, Part Number: {$partNumber}", PHP_EOL;
}

See https://eval.in/860217

Sign up to request clarification or add additional context in comments.

3 Comments

I get the table through a curl command and have stored it in $page. How would I make that work with your code?
If the page is well-formed, you should just be able to use $sxml = simplexml_load_string($page); instead of the first line. I've also edited the answer with a DOMDocument, in case that doesn't work.
Thank you - the alternative DOMDocument approach works great!
-1

I would use next XPath expression:

//td[text()="Version"] | //td[text()="Partnumber"]

Which gives me:

Element='<td>Version</td>'
Element='<td>Partnumber</td>'  
Element='<td>Version</td>'
Element='<td>Partnumber</td>'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.