Converting dataframe with data on two different levels to nested dictionary

Question

As mentioned in the title, I have data from two different levels where the first a higher level and lower level and the data on the higher level has multiple records from the lower data attached to it (i.e. a 1:n relationship). I have joined the two dataframes which hold this together to create a single dataframe with the data fro both levels, this looks something like this:

id  col1    col2    col3    col4    col5    col6
1   5       10      5       15      9       6
1   5       10      5       14      15      8
2   8       2       5       6       13      2
2   8       2       5       9       18      6
3   1       4       10      4       5       17
3   1       4       10      9       16      17

The first four columns (id, col1, col2, col3) related to the higher level data while the last three columns (col4, col5, and col6) are from the lower level data. Here you can the 1:n relationship as well, where the first four columns have the same values over multiple rows whereas the last three columns hold unique values. I would like to convert this dataframe to a python dictionary in the following format:

{
    "id": 1,
    "data": {
        "col1": 5,
        "col2": 10,
        "col3": 5,
        "low_level_data": [
            {"col4": 15, "col5": 9, "col6": 6},
            {"col4": 14, "col5": 15, "col6": 8}
        ]
    },
    "id": 2,
    "data": {
        "col1": 8,
        "col2": 2,
        "col3": 5,
        "low_level_data": [
            {"col4": 6, "col5": 13, "col6": 2},
            {"col4": 9, "col5": 18, "col6": 6}
        ]
    },
    "id": 3,
    "data": {
        "col1": 1,
        "col2": 4,
        "col3": 10,
        "lower_level_data": [
            {"col4": 4, "col5": 5, "col6": 17},
            {"col4": 9, "col5": 16, "col6": 17}
        ]
    }
}

I know that I will need to the to_dict() method but I am not exactly sure how to make sure the output will have the non-unique columns as keys in the dictionary while also having the lower level columns in a list in the level below. Other answers I've found do not seem to have the same data structure and I wasn't able to get the output I want myself. I tried the following which unfortunately does not give the wanted output.

df.groupby(["id", "col1", "col2", "col3"]).agg(lambda x: x.tolist()).to_dict("index")

# output
{('1', '5', '10', '5'): {'col4': ['15', '14'],
  'col5': ['9', '15'],
  'col6': ['6', '8']},
 ('2', '8', '2', '5'): {'col4': ['6', '9'],
  'col5': ['13', '18'],
  'col6': ['2', '6']},
 ('3', '1', '4', '10'): {'col4': ['4', '9'],
  'col5': ['5', '16'],
  'col6': ['17', '17']}}

The example dataframe can be created as follows:

data = """id    col1    col2    col3    col4    col5    col6
1   5   10  5   15  9   6
1   5   10  5   14  15  8
2   8   2   5   6   13  2
2   8   2   5   9   18  6
3   1   4   10  4   5   17
3   1   4   10  9   16  17"""
data = [x.split("\t") for x in data.split("\n")]
df = pd.DataFrame(data[1:], columns=data[0])

Your expected out put is not a valid Python dictionary: duplicate keys id, and data etc. — Quang Hoang
– Quang Hoang, Commented Nov 29, 2020 at 21:01

Alex · Accepted Answer · 2020-11-29 19:48:45Z

1

I think standard pandas methods are not really suited for this. I would simply iterate over the data.frame to build your desired output.

I guess your output shoud be a list of dicts and not a dict of dicts. Here is how a solution could look like:

result = []
for id in df.id.unique():
    tmp_df = df.loc[df.id == id]
    tmp_res = {
        "id": tmp_df["id"].iloc[0],
        "data": {
            "col1": tmp_df["col1"].iloc[0],
            "col2": tmp_df["col2"].iloc[0],
            "col3": tmp_df["col3"].iloc[0],
            "low_level_data": tmp_df.loc[:, ["col4", "col5", "col6"]].to_dict("record")
        }
    }
    result.append(tmp_res)

answered Nov 29, 2020 at 19:48

Alex

5,0052 gold badges34 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Oxbowerce · Accepted Answer · 2020-11-30 18:12:59Z

As mentioned by both Quang and Alex the expected output I provided was not a valid python dictionary, instead it should be a list of dictionaries. While the answer from Alex works and gives the expected output, I managed to find an answer myself as well using multiple groupby and apply methods which I feel are a bit more flexible.

(
    df
    .groupby(["id", "col1", "col2", "col3"])["col4", "col5", "col6"]
    .apply(lambda x: x.to_dict("r"))
    .rename("low_level_data")
    .reset_index()
    .groupby("id")["col1", "col2", "col3", "low_level_data"]
    .apply(lambda x: x.to_dict("r")[0])
    .rename("data")
    .reset_index()
    .to_dict("r")
)

# output
[{'id': '1',
  'data': {'col1': '5',
   'col2': '10',
   'col3': '5',
   'low_level_data': [{'col4': '15', 'col5': '9', 'col6': '6'},
    {'col4': '14', 'col5': '15', 'col6': '8'}]}},
 {'id': '2',
  'data': {'col1': '8',
   'col2': '2',
   'col3': '5',
   'low_level_data': [{'col4': '6', 'col5': '13', 'col6': '2'},
    {'col4': '9', 'col5': '18', 'col6': '6'}]}},
 {'id': '3',
  'data': {'col1': '1',
   'col2': '4',
   'col3': '10',
   'low_level_data': [{'col4': '4', 'col5': '5', 'col6': '17'},
    {'col4': '9', 'col5': '16', 'col6': '17'}]}}]

Collectives™ on Stack Overflow

Converting dataframe with data on two different levels to nested dictionary

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related