Fast convert JSON column into Pandas dataframe

Question

I'm reading data from a database (50k+ rows) where one column is stored as JSON. I want to extract that into a pandas dataframe. The snippet below works fine but is fairly inefficient and really takes forever when run against the whole db. Note that not all the items have the same attributes and that the JSON have some nested attributes.

How could I make this faster?

import pandas as pd
import json

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

df.data.apply(json.loads) \
       .apply(pd.io.json.json_normalize)\
       .pipe(lambda x: pd.concat(x.values))
###this returns a dataframe where each JSON key is a column

Would df.data.apply(lambda x: pd.Series(json.loads(x))) do? — Zero
– Zero, Commented Dec 18, 2016 at 15:28
Can you store your pasted data in a different (any kind of a standard) format? — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Dec 18, 2016 at 15:38
@MaxU: if possible, I'd prefer not to change the scraping script — jodoox
– jodoox, Commented Dec 18, 2016 at 16:40

piRSquared · Accepted Answer · 2016-12-18 16:43:12Z

36

json_normalize takes an already processed json string or a pandas series of such strings.

pd.io.json.json_normalize(df.data.apply(json.loads))

setup

import pandas as pd
import json

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

edited Dec 18, 2016 at 16:43

answered Dec 18, 2016 at 16:35

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jodoox Over a year ago

Thanks. That's faster than your first solution ;)

Ali Mirzaei Over a year ago

I get this error : 'DataFrame' object has no attribute 'data'

spicypumpkin Over a year ago

@AliMirzaei Replace it with your own column name.

Jan Pisl Over a year ago

It would be great if your answer and Madhur Yadav's combined so that it included a example.

jezrael · Accepted Answer · 2020-10-25 07:19:05Z

19

I think you can first convert string column data to dict, then create list of numpy arrays by values and last DataFrame.from_records:

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

a = df.data.apply(json.loads).values.tolist() 
print (pd.DataFrame.from_records(a))

Another idea:

 df = pd.json_normalize(df['data'])

edited Oct 25, 2020 at 7:19

answered Dec 18, 2016 at 15:44

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

4 Comments

jodoox Over a year ago

Thanks- That's about 100x faster than my initial approach. The only issue is that this doesn't expand the nested dicts. Would that be possible ?

jezrael Over a year ago

Check another answer ;)

skybunk Over a year ago

Quick question @jezrael the order of the csv and the df you are making from variable 'a' is the same right ? first record will be first record and second will be second and so on.. will they ever shuffle ?

jezrael Over a year ago

@skybunk - Yes, exactly. There is no reason for shuffle

Madhur Yadav · Accepted Answer · 2019-07-25 10:32:37Z

2

data = { "events":[
{
"timemillis":1563467463580, "date":"18.7.2019", "time":"18:31:03,580", "name":"Player is loading", "data":"" }, {
"timemillis":1563467463668, "date":"18.7.2019", "time":"18:31:03,668", "name":"Player is loaded", "data":"5" } ] }

from pandas.io.json import json_normalize
result = json_normalize(data,'events')
print(result)

answered Jul 25, 2019 at 10:32

Madhur Yadav

7231 gold badge13 silver badges31 bronze badges

Collectives™ on Stack Overflow

Fast convert JSON column into Pandas dataframe

3 Answers 3

4 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related