2

I have data-frame which contains json column and is quiet huge and is not very efficient, i would like to store it as nested data frame.

So sample data-frame looks like:

id                       date                                                                                                                                              ag                                                                                         marks
0  I2213 2022-01-01 13:28:05.448054  [{'type': 'A', 'values': {'X': {'F1': 0.1, 'F2': 0.2}, 'U': {'F1': 0.3, 'F2': 0.4}}}, {'type': 'B', 'results': {'Y': {'F1': 0.3, 'F2': 0.2}}}]            [{'type': 'A', 'marks': {'X': 0.5, 'U': 0.7}}, {'type': 'B', 'marks': {'Y': 0.4}}]
1  I2213 2022-01-01 14:28:05.448054                                                                                        [{'type': 'B', 'values': {'Z': {'F1': 0.4, 'F2': 0.2}}}]  [{'type': 'A', 'marks': {'X': 0.4, 'U': 0.6}}, {'type': 'B', 'marks': {'Y': 0.3, 'Z': 0.4}}]
2  I2213 2022-01-03 15:28:05.448054                                                                                        [{'type': 'A', 'values': {'X': {'F1': 0.2, 'F2': 0.1}}}]            [{'type': 'A', 'marks': {'X': 0.2, 'U': 0.9}}, {'type': 'B', 'marks': {'Y': 0.2}}]

Expected output: enter image description here

grouped by date. Sample code for generating sample dataframe:

from datetime import datetime, timedelta

def sample_data():
    ag_data = [
        "[{'type': 'A', 'values': {'X': {'F1': 0.1, 'F2': 0.2}, 'U': {'F1': 0.3, 'F2': 0.4}}}, {'type': 'B', 'results': {'Y': {'F1': 0.3, 'F2': 0.2}}}]",
        "[{'type': 'B', 'values': {'Z': {'F1': 0.4, 'F2': 0.2}}}]",
        "[{'type': 'A', 'values': {'X': {'F1': 0.2, 'F2': 0.1}}}]",
    ]
    marks_data = [
         "[{'type': 'A', 'marks': {'X': 0.5, 'U': 0.7}}, {'type': 'B', 'marks': {'Y': 0.4}}]",
         "[{'type': 'A', 'marks': {'X': 0.4, 'U': 0.6}}, {'type': 'B', 'marks': {'Y': 0.3, 'Z': 0.4}}]",
         "[{'type': 'A', 'marks': {'X': 0.2, 'U': 0.9}}, {'type': 'B', 'marks': {'Y': 0.2}}]",
    ]
    date_data = [
        datetime.now() - timedelta(3, seconds=7200),
        datetime.now() - timedelta(3, seconds=3600),
        datetime.now() - timedelta(1),
    ]
    df = pd.DataFrame()
    df['date'] = date_data
    df['ag'] = ag_data
    df['marks'] = marks_data
    df['id'] = 'I2213'
    return df

I tried with json normalization, but it's creating dataframe in columnar fashion like:

d = a['ag'].apply(lambda x: pd.json_normalize(json.loads(x.replace("'", '"'))))

gives dataframe with columns type values.X.F1 values.X.F2 values.U.F1 values.U.F2 results.Y.F1 results.Y.F2 issue is how to put dict keys like X,Y, F1,F2 as rows instead of columns.

Is it possible to achieve the desired format as shown in image?

3

2 Answers 2

0

I have tried by creating helper function.

def ag_col_helper(ag_df):
    s = pd.json_normalize(json.loads(ag_df.replace("\'", "\"")))
    s.set_index('type', inplace=True)
    s1 = s.melt(ignore_index=False, var_name='feature')
    split_vals = s1['feature'].str.split(".", n = 2, expand = True)
    s1['name'] = split_vals[1]
    s1['feature'] =  split_vals[2]
    return s1.groupby(['type', 'name', 'feature']).first().dropna()


def marks_col_helper(marks_df):
    s = pd.json_normalize(json.loads(marks_df.replace("\'", "\"")))
    s.set_index('type', inplace=True)
    s1 = s.melt(ignore_index=False, var_name='name', value_name='marks')
    split_vals = s1['name'].str.split(".", n = 2, expand = True)
    s1['name'] = split_vals[1]
    return s1.groupby(['type', 'name']).first().dropna()

Then this can be applied to the column ag

df['ag'] = df['ag'].apply(lambda x: do_something(x))
df['marks'] = df['marks'].apply(lambda x: do_something_marks(x))[0]

Then we would get for

df.iloc[0]['ag']

                   value
type name feature       
A    U    F1         0.3
          F2         0.4
     X    F1         0.1
          F2         0.2
B    Y    F1         0.3
          F2         0.2

df.iloc[0]['marks']

           marks
type name       
A    U       0.7
     X       0.5
B    Y       0.4

I think this one is what you are expecting.

For grouping the date column you can create another column using df['Date'] = df['date'].dt.date and perform a groupby.

Sign up to request clarification or add additional context in comments.

Comments

0

It appears that you can set data frames as values within a dataframe. This:

import pandas as pd

#creating outer df
df = pd.DataFrame([{'a':1, 'b':2, 'inner':None},{'a':3, 'b':4, 'inner':None}])

#creating inner dfs
inner_1 = pd.DataFrame([{'time': 0, 'e': 1}, {'time': 1, 'e': 2}])
inner_2 = pd.DataFrame([{'time': 0, 'e': 6}, {'time': 1, 'e': 7}])
inners = [inner_1, inner_2]

df['inner'] = inners
print(df)

results in this:

   a  b       inner
0  1  2        time  e
           0     0  1
           1     1  2
1  3  4        time  e
           0     0  6
           1     1  7

the print out quickly gets confusing, but it seems like it's what you want.

for your data specifically, take your lists of dicts and convert them to a df with pd.DataFrame. If you want to turn all your lists to dataframes, you can use something like this:

import pandas as pd

#creating outer df
df = pd.DataFrame([{'a':1, 'b':2, 'inner':None},{'a':3, 'b':4, 'inner':None}])

#creating inner dfs
inner_1 = [{'time': 0, 'e': 1}, {'time': 1, 'e': 2}]
inner_2 = [{'time': 0, 'e': 6}, {'time': 1, 'e': 7}]
inners = [inner_1, inner_2]

df['inner'] = inners
print('un-transformed')
print(df)

#transforming all lists into DFs
for i in range(df.shape[0]): #iterate over rows
    for j in range(df.shape[1]): #iterate over columns
        if type(df.iat[i,j]) == list: #filtering cells that are lists
            df.iat[i, j] = pd.DataFrame(df.iat[i, j]) #convert to df

print("transformed")
print(df)

which returns

un-transformed
   a  b                                       inner
0  1  2  [{'time': 0, 'e': 1}, {'time': 1, 'e': 2}]
1  3  4  [{'time': 0, 'e': 6}, {'time': 1, 'e': 7}]
transformed
   a  b       inner
0  1  2        time  e
           0     0  1
           1     1  2
1  3  4        time  e
           0     0  6
           1     1  7

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.