29

I'm reading data from a database (50k+ rows) where one column is stored as JSON. I want to extract that into a pandas dataframe. The snippet below works fine but is fairly inefficient and really takes forever when run against the whole db. Note that not all the items have the same attributes and that the JSON have some nested attributes.

How could I make this faster?

import pandas as pd
import json

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

df.data.apply(json.loads) \
       .apply(pd.io.json.json_normalize)\
       .pipe(lambda x: pd.concat(x.values))
###this returns a dataframe where each JSON key is a column
4
  • 3
    Would df.data.apply(lambda x: pd.Series(json.loads(x))) do? Commented Dec 18, 2016 at 15:28
  • Can you store your pasted data in a different (any kind of a standard) format? Commented Dec 18, 2016 at 15:38
  • @JohnGalt: works but that doesn't flatten the dict Commented Dec 18, 2016 at 16:27
  • @MaxU: if possible, I'd prefer not to change the scraping script Commented Dec 18, 2016 at 16:40

3 Answers 3

36

json_normalize takes an already processed json string or a pandas series of such strings.

pd.io.json.json_normalize(df.data.apply(json.loads))

setup

import pandas as pd
import json

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. That's faster than your first solution ;)
I get this error : 'DataFrame' object has no attribute 'data'
@AliMirzaei Replace it with your own column name.
It would be great if your answer and Madhur Yadav's combined so that it included a example.
19

I think you can first convert string column data to dict, then create list of numpy arrays by values and last DataFrame.from_records:

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

a = df.data.apply(json.loads).values.tolist() 
print (pd.DataFrame.from_records(a))

Another idea:

 df = pd.json_normalize(df['data'])

4 Comments

Thanks- That's about 100x faster than my initial approach. The only issue is that this doesn't expand the nested dicts. Would that be possible ?
Check another answer ;)
Quick question @jezrael the order of the csv and the df you are making from variable 'a' is the same right ? first record will be first record and second will be second and so on.. will they ever shuffle ?
@skybunk - Yes, exactly. There is no reason for shuffle
2

data = { "events":[
{
"timemillis":1563467463580, "date":"18.7.2019", "time":"18:31:03,580", "name":"Player is loading", "data":"" }, {
"timemillis":1563467463668, "date":"18.7.2019", "time":"18:31:03,668", "name":"Player is loaded", "data":"5" } ] }

from pandas.io.json import json_normalize
result = json_normalize(data,'events')
print(result)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.