1

I fetch data with cURL that I parse with DOMDocument and XPATH. strlen() is giving irregular counts.

Some intro code:

curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);     
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($data);
$xpath = new DOMXpath($dom);

I fetch the data I need and it works well, but now I need to compare two strings. Original is taken straight from a <li>-tag. Parsed is four or five <span>s joined together.

$original = $i[$n]['full'];
$parsed = $i[$n]['value'].$i[$n]['type'].$i[$n]['name'].$i[$n]['extra'];

echo $original."<br>";
echo $parsed."<br><br>";
echo strlen($original)."<br>";
echo strlen($parsed)."<br><br>";

give:

4 -5 boneless chicken breasts
4-5Boneless chicken breasts

70
27

I started messing around by replacing all spaces, trying mb_strlen with different encodings, typecasting to string, but all to no avail:

$replace = array(' ',',');
$mod_original = str_replace($replace,'',$original);
$mod_parsed = str_replace($replace,'',$parsed);

var_dump($mod_original);
echo "<br>";
var_dump($mod_parsed);
echo "<br><br>";

echo mb_strlen($mod_original,'UTF-8')."<br>";
echo mb_strlen($mod_parsed,'UTF-8')."<br>";

Results:

string(62) "4-5 bonelesschickenbreasts" 
string(25) "4-5Bonelesschickenbreasts" 

62
25

Something is strange. str_replace won't even remove that last whitespace.

Any help is appreciated.

1 Answer 1

1

I can tell that you are viewing this in your browser from the fact that the echo "<br>" statements make a new line. Other html elements will be rendered by the browser as well. If they occur at the end of the string, then they could have no effect on the displayed text, but would affect the length. Heck, they could even occur in the middle of the string without affecting formatting, if the tags happened to be of a type that would not change the appearance of the string's output.

Another possibilty is that you have other whitespace chars or non-printable chars.

To confirm which, view the source of the document in your browser, instead of looking at the rendered output. If you don't see anything at that point, try downloading the document and looking at it in a good text editor (like Notepad++) where you can adjust what characters are shown to include chars that are typically not printed.

Once you figure out which characters/tags are causing the issue, then you can create a str_replace() or preg_replace() call to deal with it appropriately.

Sign up to request clarification or add additional context in comments.

1 Comment

I am, and you are most likely correct. Do you know enough XPATH to give a query('//li[@class="i"]/'); that parse it at plain text, without saving tag information?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.