String length problems when using PHP, DOMDocument and XPATH

Question

I fetch data with cURL that I parse with DOMDocument and XPATH. strlen() is giving irregular counts.

Some intro code:

curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);     
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($data);
$xpath = new DOMXpath($dom);

I fetch the data I need and it works well, but now I need to compare two strings. Original is taken straight from a <li>-tag. Parsed is four or five <span>s joined together.

$original = $i[$n]['full'];
$parsed = $i[$n]['value'].$i[$n]['type'].$i[$n]['name'].$i[$n]['extra'];

echo $original."<br>";
echo $parsed."<br><br>";
echo strlen($original)."<br>";
echo strlen($parsed)."<br><br>";

give:

4 -5 boneless chicken breasts
4-5Boneless chicken breasts

70
27

I started messing around by replacing all spaces, trying mb_strlen with different encodings, typecasting to string, but all to no avail:

$replace = array(' ',',');
$mod_original = str_replace($replace,'',$original);
$mod_parsed = str_replace($replace,'',$parsed);

var_dump($mod_original);
echo "<br>";
var_dump($mod_parsed);
echo "<br><br>";

echo mb_strlen($mod_original,'UTF-8')."<br>";
echo mb_strlen($mod_parsed,'UTF-8')."<br>";

Results:

string(62) "4-5 bonelesschickenbreasts" 
string(25) "4-5Bonelesschickenbreasts" 

62
25

Something is strange. str_replace won't even remove that last whitespace.

Any help is appreciated.

Jeffrey Blake · Accepted Answer · 2013-08-01 13:05:07Z

1

I can tell that you are viewing this in your browser from the fact that the echo "<br>" statements make a new line. Other html elements will be rendered by the browser as well. If they occur at the end of the string, then they could have no effect on the displayed text, but would affect the length. Heck, they could even occur in the middle of the string without affecting formatting, if the tags happened to be of a type that would not change the appearance of the string's output.

Another possibilty is that you have other whitespace chars or non-printable chars.

To confirm which, view the source of the document in your browser, instead of looking at the rendered output. If you don't see anything at that point, try downloading the document and looking at it in a good text editor (like Notepad++) where you can adjust what characters are shown to include chars that are typically not printed.

Once you figure out which characters/tags are causing the issue, then you can create a str_replace() or preg_replace() call to deal with it appropriately.

answered Aug 1, 2013 at 13:05

Jeffrey Blake

9,7396 gold badges46 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mattis Over a year ago

I am, and you are most likely correct. Do you know enough XPATH to give a query('//li[@class="i"]/'); that parse it at plain text, without saving tag information?

Collectives™ on Stack Overflow

String length problems when using PHP, DOMDocument and XPATH

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related