As mentioned in the title, I have data from two different levels where the first a higher level and lower level and the data on the higher level has multiple records from the lower data attached to it (i.e. a 1:n relationship). I have joined the two dataframes which hold this together to create a single dataframe with the data fro both levels, this looks something like this:
id col1 col2 col3 col4 col5 col6
1 5 10 5 15 9 6
1 5 10 5 14 15 8
2 8 2 5 6 13 2
2 8 2 5 9 18 6
3 1 4 10 4 5 17
3 1 4 10 9 16 17
The first four columns (id, col1, col2, col3) related to the higher level data while the last three columns (col4, col5, and col6) are from the lower level data. Here you can the 1:n relationship as well, where the first four columns have the same values over multiple rows whereas the last three columns hold unique values. I would like to convert this dataframe to a python dictionary in the following format:
{
"id": 1,
"data": {
"col1": 5,
"col2": 10,
"col3": 5,
"low_level_data": [
{"col4": 15, "col5": 9, "col6": 6},
{"col4": 14, "col5": 15, "col6": 8}
]
},
"id": 2,
"data": {
"col1": 8,
"col2": 2,
"col3": 5,
"low_level_data": [
{"col4": 6, "col5": 13, "col6": 2},
{"col4": 9, "col5": 18, "col6": 6}
]
},
"id": 3,
"data": {
"col1": 1,
"col2": 4,
"col3": 10,
"lower_level_data": [
{"col4": 4, "col5": 5, "col6": 17},
{"col4": 9, "col5": 16, "col6": 17}
]
}
}
I know that I will need to the to_dict() method but I am not exactly sure how to make sure the output will have the non-unique columns as keys in the dictionary while also having the lower level columns in a list in the level below. Other answers I've found do not seem to have the same data structure and I wasn't able to get the output I want myself. I tried the following which unfortunately does not give the wanted output.
df.groupby(["id", "col1", "col2", "col3"]).agg(lambda x: x.tolist()).to_dict("index")
# output
{('1', '5', '10', '5'): {'col4': ['15', '14'],
'col5': ['9', '15'],
'col6': ['6', '8']},
('2', '8', '2', '5'): {'col4': ['6', '9'],
'col5': ['13', '18'],
'col6': ['2', '6']},
('3', '1', '4', '10'): {'col4': ['4', '9'],
'col5': ['5', '16'],
'col6': ['17', '17']}}
The example dataframe can be created as follows:
data = """id col1 col2 col3 col4 col5 col6
1 5 10 5 15 9 6
1 5 10 5 14 15 8
2 8 2 5 6 13 2
2 8 2 5 9 18 6
3 1 4 10 4 5 17
3 1 4 10 9 16 17"""
data = [x.split("\t") for x in data.split("\n")]
df = pd.DataFrame(data[1:], columns=data[0])
id, anddataetc.