1

I've got some sort of encoding issue when trying to retrieve a third-party feed, which when using json_last_error() reports back with Unexpected control character found.

From what I've read, this can be caused by a non UTF-8 character appearing in the mix.

I've run the copied JSON through a linter, and is valid. Copy/pasting the JSON from the remote feed into a string and decoding that way works fine, just not when directly accessing via file_get_contents.

{
    "numberOfResults": 124,
    "queryTime": 0,
    "products": [
        {
            "productId": "9130047$0290f955-ce36-46c9-9771-184f05985c62",
            "status": null,
            "serviceId": null,
            "productName": null,
            "serviceName": null,
            "productDescription": null,
            "serviceDescription": null,
            "productCategoryId": null,
            "nearestLocation": null,
            "boundary": null,
            "distanceToLocation": null,
            "startDate": null,
            "endDate": null,
            "productImage": null,
            "serviceImage": null,
            "tqual": null,
            "trip_advisor": null,
            "freeEntry": null,
            "booster": null,
            "starRating": null,
            "rateFrom": null,
            "rateTo": null,
            "productClassifications": null,
            "internet_service_ssid": null,
            "internet_service_type": null,
            "linked_productid": null,
            "states": null,
            "suburbs": null,
            "addresses": null,
            "cities": null,
            "comms_em": null,
            "comms_mb": null,
            "comms_burl": null,
            "comms_url": null,
            "comms_ph": null,
            "comms_fx": null,
            "comms_wap": null,
            "internet_points": null
        }
    ],
    "facetGroups": []
}

And just a simple decode...

$raw = file_get_contents($url);
$result = json_decode($raw, false);

// json_last_error() shows JSON_ERROR_CTRL_CHAR
7
  • possible duplicate of Problem with json_decode PHP Commented Nov 11, 2014 at 21:28
  • Reported as a bug in PHP 5.32. What are oyu using? grokbase.com/t/php/php-bugs/1076k3pade/… Commented Nov 11, 2014 at 21:29
  • Using 5.4.34, also tried using stripslashes and htmlentities... Commented Nov 11, 2014 at 21:48
  • Did you check, visually, what $raw contains? Commented Nov 11, 2014 at 21:49
  • 1
    Run the data through hd, chances are that there are invisible chars that still violate the JSON spec. Alternatively, regex-search for anything that is not inside the expected character set and see what you find. Commented Nov 11, 2014 at 21:51

1 Answer 1

0

Thanks to @UlrichEckhardt suggestion, this link provided some nice Regex in case anyone else comes across this issue.

// Modified from http://magp.ie/2011/01/06/remove-non-utf8-characters-from-string-with-php/
// Simply strip out incompatible chars
function lint_json($string) {
    //reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
    $string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]|[\x00-\x7F][\x80-\xBF]+|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S', '', $string );

    //reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
    $string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]|\xED[\xA0-\xBF][\x80-\xBF]/S','', $string );

    return $string;
}

EDIT:

After further investigation, it came down to the supplied JSON being in UTF-16, which causes obvious issues when using json_decode. The below code fixes that.

function lint_json2($string) {
    $string = iconv('UTF-16LE//IGNORE', 'UTF-8', $string);

    // Dirty, but strip anything before first JSON opening tag
    $string = strstr($string, '{');

    return $string;
}
Sign up to request clarification or add additional context in comments.

2 Comments

One of the points of JSON is that it is Unicode-capable. Filtering anything outside of the basic multilingual plane means that it doesn't work for several cases. Fix the code that generates the broken JSON output instead of trying to work around it.
That's a fair call - however I don't have control over what's generating the feed. All I can do is suggest to them to fix it. However, further investigation in edit above.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.