Pandas json column store as nested dataframe

Question

I have data-frame which contains json column and is quiet huge and is not very efficient, i would like to store it as nested data frame.

So sample data-frame looks like:

id                       date                                                                                                                                              ag                                                                                         marks
0  I2213 2022-01-01 13:28:05.448054  [{'type': 'A', 'values': {'X': {'F1': 0.1, 'F2': 0.2}, 'U': {'F1': 0.3, 'F2': 0.4}}}, {'type': 'B', 'results': {'Y': {'F1': 0.3, 'F2': 0.2}}}]            [{'type': 'A', 'marks': {'X': 0.5, 'U': 0.7}}, {'type': 'B', 'marks': {'Y': 0.4}}]
1  I2213 2022-01-01 14:28:05.448054                                                                                        [{'type': 'B', 'values': {'Z': {'F1': 0.4, 'F2': 0.2}}}]  [{'type': 'A', 'marks': {'X': 0.4, 'U': 0.6}}, {'type': 'B', 'marks': {'Y': 0.3, 'Z': 0.4}}]
2  I2213 2022-01-03 15:28:05.448054                                                                                        [{'type': 'A', 'values': {'X': {'F1': 0.2, 'F2': 0.1}}}]            [{'type': 'A', 'marks': {'X': 0.2, 'U': 0.9}}, {'type': 'B', 'marks': {'Y': 0.2}}]

Expected output:

grouped by date. Sample code for generating sample dataframe:

from datetime import datetime, timedelta

def sample_data():
    ag_data = [
        "[{'type': 'A', 'values': {'X': {'F1': 0.1, 'F2': 0.2}, 'U': {'F1': 0.3, 'F2': 0.4}}}, {'type': 'B', 'results': {'Y': {'F1': 0.3, 'F2': 0.2}}}]",
        "[{'type': 'B', 'values': {'Z': {'F1': 0.4, 'F2': 0.2}}}]",
        "[{'type': 'A', 'values': {'X': {'F1': 0.2, 'F2': 0.1}}}]",
    ]
    marks_data = [
         "[{'type': 'A', 'marks': {'X': 0.5, 'U': 0.7}}, {'type': 'B', 'marks': {'Y': 0.4}}]",
         "[{'type': 'A', 'marks': {'X': 0.4, 'U': 0.6}}, {'type': 'B', 'marks': {'Y': 0.3, 'Z': 0.4}}]",
         "[{'type': 'A', 'marks': {'X': 0.2, 'U': 0.9}}, {'type': 'B', 'marks': {'Y': 0.2}}]",
    ]
    date_data = [
        datetime.now() - timedelta(3, seconds=7200),
        datetime.now() - timedelta(3, seconds=3600),
        datetime.now() - timedelta(1),
    ]
    df = pd.DataFrame()
    df['date'] = date_data
    df['ag'] = ag_data
    df['marks'] = marks_data
    df['id'] = 'I2213'
    return df

I tried with json normalization, but it's creating dataframe in columnar fashion like:

d = a['ag'].apply(lambda x: pd.json_normalize(json.loads(x.replace("'", '"'))))

gives dataframe with columns type values.X.F1 values.X.F2 values.U.F1 values.U.F2 results.Y.F1 results.Y.F2 issue is how to put dict keys like X,Y, F1,F2 as rows instead of columns.

Is it possible to achieve the desired format as shown in image?

Maybe this answer can help you, it seems doing something similar for a much simpler JSON structure: stackoverflow.com/a/32486449/1703619 — Miguel
– Miguel, Commented Jan 12, 2022 at 14:33
Reshaping Columns -> rows => Use melt. pandas.pydata.org/pandas-docs/stable/user_guide/… or stack pandas.pydata.org/pandas-docs/stable/user_guide/… — Emma
– Emma, Commented Jan 14, 2022 at 22:14
Can you please edit the question and provide the code create sample dataframe? — Lawhatre
– Lawhatre, Commented Jan 15, 2022 at 8:42

Irfanuddin · Accepted Answer · 2022-01-15 23:26:50Z

I have tried by creating helper function.

def ag_col_helper(ag_df):
    s = pd.json_normalize(json.loads(ag_df.replace("\'", "\"")))
    s.set_index('type', inplace=True)
    s1 = s.melt(ignore_index=False, var_name='feature')
    split_vals = s1['feature'].str.split(".", n = 2, expand = True)
    s1['name'] = split_vals[1]
    s1['feature'] =  split_vals[2]
    return s1.groupby(['type', 'name', 'feature']).first().dropna()


def marks_col_helper(marks_df):
    s = pd.json_normalize(json.loads(marks_df.replace("\'", "\"")))
    s.set_index('type', inplace=True)
    s1 = s.melt(ignore_index=False, var_name='name', value_name='marks')
    split_vals = s1['name'].str.split(".", n = 2, expand = True)
    s1['name'] = split_vals[1]
    return s1.groupby(['type', 'name']).first().dropna()

Then this can be applied to the column ag

df['ag'] = df['ag'].apply(lambda x: do_something(x))
df['marks'] = df['marks'].apply(lambda x: do_something_marks(x))[0]

Then we would get for

df.iloc[0]['ag']

                   value
type name feature       
A    U    F1         0.3
          F2         0.4
     X    F1         0.1
          F2         0.2
B    Y    F1         0.3
          F2         0.2

df.iloc[0]['marks']

           marks
type name       
A    U       0.7
     X       0.5
B    Y       0.4

I think this one is what you are expecting.

For grouping the date column you can create another column using df['Date'] = df['date'].dt.date and perform a groupby.

Daniel Warfield · Accepted Answer · 2022-01-19 22:07:42Z

It appears that you can set data frames as values within a dataframe. This:

import pandas as pd

#creating outer df
df = pd.DataFrame([{'a':1, 'b':2, 'inner':None},{'a':3, 'b':4, 'inner':None}])

#creating inner dfs
inner_1 = pd.DataFrame([{'time': 0, 'e': 1}, {'time': 1, 'e': 2}])
inner_2 = pd.DataFrame([{'time': 0, 'e': 6}, {'time': 1, 'e': 7}])
inners = [inner_1, inner_2]

df['inner'] = inners
print(df)

results in this:

   a  b       inner
0  1  2        time  e
           0     0  1
           1     1  2
1  3  4        time  e
           0     0  6
           1     1  7

the print out quickly gets confusing, but it seems like it's what you want.

for your data specifically, take your lists of dicts and convert them to a df with pd.DataFrame. If you want to turn all your lists to dataframes, you can use something like this:

import pandas as pd

#creating outer df
df = pd.DataFrame([{'a':1, 'b':2, 'inner':None},{'a':3, 'b':4, 'inner':None}])

#creating inner dfs
inner_1 = [{'time': 0, 'e': 1}, {'time': 1, 'e': 2}]
inner_2 = [{'time': 0, 'e': 6}, {'time': 1, 'e': 7}]
inners = [inner_1, inner_2]

df['inner'] = inners
print('un-transformed')
print(df)

#transforming all lists into DFs
for i in range(df.shape[0]): #iterate over rows
    for j in range(df.shape[1]): #iterate over columns
        if type(df.iat[i,j]) == list: #filtering cells that are lists
            df.iat[i, j] = pd.DataFrame(df.iat[i, j]) #convert to df

print("transformed")
print(df)

which returns

un-transformed
   a  b                                       inner
0  1  2  [{'time': 0, 'e': 1}, {'time': 1, 'e': 2}]
1  3  4  [{'time': 0, 'e': 6}, {'time': 1, 'e': 7}]
transformed
   a  b       inner
0  1  2        time  e
           0     0  1
           1     1  2
1  3  4        time  e
           0     0  6
           1     1  7

Collectives™ on Stack Overflow

Pandas json column store as nested dataframe

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related