3

I have invalid external json data, without double quotes around names.

Example:

{
  data: [
    {
      idx: 0,
      id: "0",
      url: "http://247wallst.com/",
      a: [
        {
          t: "Title",
          u: "http://247wallst.com/2012/07/30/",
          sp: "About"
        }
      ],
      doc_id: "9386093612452939480"
    },
    {
      idx: 1,
      id: "-1"
    }
  ],
  results_per_page: 10,
  total_number_of_news: 76,
  news_per_month: [20, 0, 8, 1, 1, 2, 0, 2, 1, 0, 0, 1, 1, 0, 5, 1, 1, 1, 0, 2, 5, 16, 7, 1],
  result_start_num: 2,
  result_end_num: 2,
  result_total_articles: 76
}

As you see a lot of names like data,idx,id,url and others are not double quoted, so this makes this json invalid. How can I make this external json valid? I already tried str_replace, replacing '{' to '{"' and ':' to '":' adding double quotes around unquoted names, but this messes up some already double quoted variables.

How can I make this json valid so I can read this data with PHP json_decode? I'm not very familiar with preg_replace..

Valid json will look like:

{
  "data": [
    {
      "idx": 0,
      "id": "0",
      "url": "http://247wallst.com/",
      "a": [
        {
          "t": "Title",
          "u": "http://247wallst.com/2012/07/30/",
          "sp": "About"
        }
      ],
      "doc_id": "9386093612452939480"
    },
    {
      "idx": 1,
      "id": "-1"
    }
  ],
  "results_per_page": 10,
  "total_number_of_news": 76,
  "news_per_month": [20, 0, 8, 1, 1, 2, 0, 2, 1, 0, 0, 1, 1, 0, 5, 1, 1, 1, 0, 2, 5, 16, 7, 1],
  "result_start_num": 2,
  "result_end_num": 2,
  "result_total_articles": 76
}

Please suggest me some php preg_replace function.

Data source: http://www.google.com/finance/company_news?q=aapl&output=json&start=1&num=1

3
  • 1
    Where is this data being constructed? Is it a script that you control? Commented Jul 30, 2012 at 20:29
  • 1
    This probably isn't particularly helpful, but while what you show is not valid JSON it is valid Javascript - so if you can fire it through Javascript and then JSON encode it again, it will make your life pretty easy. If you've got Node.js handy this can be done with a simple exec() call. Although this is not exactly a great long-term solution. Commented Jul 30, 2012 at 20:31
  • 1
    I do not control construction of this script. Commented Jul 30, 2012 at 20:32

2 Answers 2

5

With preg_replace you can do:

json_decode(preg_replace('#(?<pre>\{|\[|,)\s*(?<key>(?:\w|_)+)\s*:#im', '$1"$2":', $in));

Since the above example won't work with real data (the battle plans seldom survive first contact with the enemy) heres my second take:

$infile = 'http://www.google.com/finance/company_news?q=aapl&output=json&start=1&num=1';

// first, get rid of the \x26 and other encoded bytes.
$in = preg_replace_callback('/\\\x([0-9A-F]{2})/i',
    function($match){
        return chr(intval($match[1], 16));
    }, file_get_contents($infile));

$out = $in;

// find key candidates
preg_match_all('#(?<=\{|\[|,)\s*(?<key>(?:\w|_)+?)\s*:#im', $in, $m, PREG_OFFSET_CAPTURE);

$replaces_so_far = 0;
// check each candidate if its in a quoted string or not
foreach ($m['key'] as $match) {
    $position = $match[1] + ($replaces_so_far * 2); // every time you expand one key, offsets need to be shifted with 2 (for the two " chars)
    $key = $match[0];
    $quotes_before = preg_match_all('/(?<!\\\)"/', substr($out, 0, $position), $m2);
    if ($quotes_before % 2) { // not even number of not-escaped quotes, we are in quotes, ignore candidate
        continue;
    }
    $out = substr_replace($out, '"'.$key.'"', $position, strlen($key));
    ++$replaces_so_far;
}

var_export(json_decode($out, true));

But since google offers this data in RSS feed, i would recommend you to use that one if it works for your usecase, this is just for fun (-:

Sign up to request clarification or add additional context in comments.

7 Comments

That's what I was looking for! Works great with one element, but somehow fails to decode multiple elements(more than one) Fails to read google.com/finance/…
It looks like to me that its failing on a t:"Apple, Samsung, Kodak, Imation: Intellectual Property" part of the input, i guess you could try hack the regexp but maybe writing a proper parse would be a better idea.
I figured it out! Replaced couple escaped characters and now it works str_replace(array("\\x26","#39;"),array("","'"),$string); Thankyou!
I've added a supposedly more robust way of doing this, not a one-liner but works correctly on your example data.
As you mentioned above t:"Apple, Samsung, Kodak, Imation: Intellectual Property"-likish values really cause errors. Second take you suggested fails at line 5 on my server, but I pretty understand your idea.
|
4

The JSON feeds from Google always seem to be plagued with problems- formatted incorrectly in some way shape or form. If you switch the feed to RSS you can easily convert it to an array or JSON from the array.

<?php

$contents = file_get_contents('http://www.google.com/finance/company_news?q=aapl&output=rss&start=1&num=1');

// Convert the RSS to an array (probably just use this)
$arr = simplexml_load_string($contents);

// Or if you specifically want JSON
$json = json_encode($arr);

// And back to an array
print_r(json_decode($json));

1 Comment

interesting solution, but [description] value of each element comes not properly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.