5

I have a JSON file which has multiple objects such as:

 {"reviewerID": "bc19970fff3383b2fe947cf9a3a5d7b13b6e57ef2cd53abc52bb2dfedf5fb1cd", "asin": "a6ed402934e3c1138111dce09256538afb04c566edf37c16b9ba099d23afb764", "overall": 2.0, "helpful": {"nHelpful": 1, "outOf": 1}, "reviewText": "This remote, for whatever reason, was chosen by Time Warner to replace their previous silver remote, the Time Warner Synergy V RC-U62CP-1.12S.  The actual function of this CLIKR-5 is OK, but the ergonomic design sets back remotes by 20 years.  The buttons are all the same, there's no separation of the number buttons, the volume and channel buttons are the same shape as the other buttons on the remote, and it all adds up to a crappy user experience.  Why would TWC accept this as a replacement?    I'm skipping this and paying double for a refurbished Synergy V.", "summary": "Ergonomic nightmare", "unixReviewTime": 1397433600}

{"reviewerID": "3689286c8658f54a2ff7aa68ce589c81f6cae4c4d9de76fa0f66d5c114f79837", "asin": "8939d791e9dd035aa58da024ace69b20d651cea4adf6159d984872b44f663301", "overall": 4.0, "helpful": {"nHelpful": 21, "outOf": 22}, "reviewText": "This is a great truck GPS. I've tried others and nothing seems to come close to the Rand McNally TND-700.Excellent screen size and resolution. The audio is loud enough to be heard over road noise and the purr of my Kenworth/Cat engine. I've used it for the last 8,000 miles or so and it has only glitched once. Just restarted it and it picked up on my route right where it should have.Clean up the minor issues and this unit rates a solid 5.Rand McNally 528881469 7-inch Intelliroute TND 700 Truck GPS", "summary": "Great Unit!", "unixReviewTime": 1280016000}

I am trying to convert it to a Pandas DataFrame using the following code:

train_df = pd.DataFrame()
count = 0;
for l in open('train.json'):
    try:
        count +=1
        if(count==20001):
            break
        obj1 = json.loads(l)
        df1=pd.DataFrame(obj1, index=[0])
        train_df = train_df.append(df1, ignore_index=True)
    except ValueError:
        line = line.replace('\\','')
        obj = json.loads(line)
        df1=pd.DataFrame(obj, index=[0])
        train_df = train_df.append(df1, ignore_index=True)

However, it gives me 'NaN' for nested values i.e. 'helpful' attribute. I want the output such that both the keys of the nested attribute are a separate column.

EDIT:

P.S: I am using try/except because I have '\' character in some objects which gives me a JSON decode error.

Can anyone help? Is there any other approach I can use?

Thank You.

4
  • Have you tried pandas.read_json? pandas.pydata.org/pandas-docs/stable/generated/… Commented Nov 29, 2016 at 9:17
  • @DeepSpace Yes, i have. It gives me error saying ValueError:'trailing data' Commented Nov 29, 2016 at 9:21
  • Trailing data means there is extra data in your file that is not part of the json object. Have a look in your file and make sure it is all valid json. Commented Nov 29, 2016 at 9:24
  • @RichSmith I tried taking a look at the file but the file is too large to open in an editor. Also, when i tried using the above code, It gave me a dataframe but it just gives 'NaN' for the nested attribute 'helpful'. Commented Nov 29, 2016 at 9:26

2 Answers 2

4

Use json_normalize on the list of dictionaries which performs reasonably faster on large number of json objects.

from pandas.io.json import json_normalize

my_list = []
with open('train.json') as f:
    for line in f:
        line = line.replace('\\','')
        my_list.append(json.loads(line))

# avoid transposing if you want to keep keys as columns of the dataframe
result_df = json_normalize(my_list).T

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

0

try:

pd.concat([pd.Series(json.loads(line)) for line in open('train.json')], axis=1)

enter image description here

1 Comment

This seems to work. Is there a way by which i can just do the above mentioned solution for the first 100 objects and store them in a separate dataframe? The file is very big and I cannot run the above mentioned solution to run on the entire file. Also, is there a way I can use try/except with this? Because I have a '\' in some objects which is giving me a JsonDecodeError

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.