This hierarchical json has parent->child relationship. Therefore, I suggest you use networkx package to construct a graph g from the dateframe. From graph g create the desired json.
You need to pip install networkx following this instruction
After that, do the following steps:
- import networkx as nx
- create the directed graph
g using from_pandas_edgelist
- create a dictionary
d with unique key on parent_id and child_id
with values being as level
- assign
d to nodes attributes level of graph g
- finally, call
nx.tree_data on g
The code as follows:
import networkx as nx
#create the directed graph `g` using `from_pandas_edgelist`
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', ['level'],
create_using=nx.DiGraph())
#create a dictionary with unique key on `parent_id` and `child_id`
df1 = df.melt('level')
d = dict(zip(df1.value, df1.level))
root_id = df.loc[0, 'parent_id']
d[root_id] = 0 #set level of root to `0`
#assign `d` to nodes attributes `level` of graph `g`
nx.set_node_attributes(g, d, 'level') #add `level` values to nodes of `g`
out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]
Out[119]:
[{'level': 0,
'id': 125582,
'_children': [{'level': 1, 'id': 214659},
{'level': 1, 'id': 214633},
{'level': 1,
'id': 214263,
'_children': [{'level': 2, 'id': 131673},
{'level': 2, 'id': 125579},
{'level': 2, 'id': 125578},
{'level': 2,
'id': 172670,
'_children': [{'level': 3, 'id': 172669},
{'level': 3, 'id': 174777},
{'level': 3,
'id': 207661,
'_children': [{'level': 4, 'id': 216529},
{'level': 4,
'id': 223884,
'_children': [{'level': 5,
'id': 223885,
'_children': [{'level': 6,
'id': 229186,
'_children': [{'level': 7, 'id': 219062},
{'level': 7, 'id': 222243}]}]}]}]}]},
{'level': 2,
'id': 214266,
'_children': [{'level': 3, 'id': 216675},
{'level': 3, 'id': 216671}]}]}]}]
Note: the output above is the dictionary inside of list as your desired output. My solution output puts the key level in front of key id, but the hierarchical/structural of sub-dictionaries is the same as your desired output
On the other hand, if you don't need the level values in the desired output, just directly use the built-in nx.tree_data function to return the the output as follows
import networkx as nx
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', ['level'],
create_using=nx.DiGraph())
root_id = df.loc[0, 'parent_id']
out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]
Out[168]:
[{'id': 125582,
'_children': [{'id': 214659},
{'id': 214633},
{'id': 214263,
'_children': [{'id': 131673},
{'id': 125579},
{'id': 125578},
{'id': 172670,
'_children': [{'id': 172669},
{'id': 174777},
{'id': 207661,
'_children': [{'id': 216529},
{'id': 223884,
'_children': [{'id': 223885,
'_children': [{'id': 229186,
'_children': [{'id': 219062}, {'id': 222243}]}]}]}]}]},
{'id': 214266, '_children': [{'id': 216675}, {'id': 216671}]}]}]}]
Update:
Use this modified function of nx.tree_data to overcome the restriction on single parent per child. Add this customized function to your code and call it instead of nx.tree_data
from itertools import chain
def tree_data_custom(G, root):
id_ = 'id'
children = '_children'
def add_children(n, G):
nbrs = G[n]
if len(nbrs) == 0:
return []
children_ = []
for child in nbrs:
d = dict(chain(G.nodes[child].items(), [(id_, child)]))
c = add_children(child, G)
if c:
d[children] = c
children_.append(d)
return children_
data = dict(chain(G.nodes[root].items(), [(id_, root)]))
data[children] = add_children(root, G)
return data
Instead of call out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})], just call tree_data_custom as follows
out = [tree_data_custom(g, root_id)]
Update 2: Adding name column to dataframe
sample df where children has multiple parents
Out[258]:
parent_id child_id level name
0 125582 214659 1 a1
1 125582 214633 1 a1
2 125582 214263 1 a1
3 214263 131673 2 a2
4 214263 125579 2 a2
5 214263 125578 2 a2
6 214263 172670 2 a2
7 214263 214266 2 a2
8 214266 216675 3 a3
9 214266 216671 3 a3
10 172670 172669 3 a3
11 172670 174777 3 a3
12 172670 207661 3 a3
13 207661 216529 4 a4
14 207661 223884 4 a4
15 223884 223885 5 a5
16 223885 229186 6 a6
17 229186 219062 7 a7
18 229186 222243 7 a7
19 222243 219187 8 a8
20 222243 245985 8 a8
21 222243 232393 8 a8
22 222243 247138 8 a8
23 222243 228848 8 a8
24 222243 228848 8 a8
25 222243 233920 8 a8
26 222243 233920 8 a8
27 222243 228113 8 a8
28 222243 233767 8 a8
29 222243 235407 8 a8
30 222243 237757 8 a8
31 222243 159091 8 a8
32 222243 159091 8 a8
33 222243 214832 8 a8
34 222243 253990 8 a8
35 222243 231610 8 a8
36 222243 231610 8 a8
37 222243 182323 8 a8
38 222243 242190 8 a8
39 222243 143580 8 a8
40 222243 242188 8 a8
41 222243 143581 8 a8
42 222243 242187 8 a8
43 222243 143582 8 a8
44 222243 242189 8 a8
45 222243 205877 8 a8
46 222243 242823 8 a8
47 222243 140979 8 a8
48 222243 237824 8 a8
49 222243 149933 8 a8
50 222243 149933 8 a8
51 222243 153625 8 a8
52 222243 8392 8 a8
53 222243 162085 8 a8
54 222243 162085 8 a8
55 222243 150691 8 a8
56 222243 147773 8 a8
57 222243 147773 8 a8
58 222243 61070 8 a8
59 222243 61070 8 a8
60 222243 204850 8 a8
61 222243 204850 8 a8
62 61070 46276 9 a9
63 61070 46276 9 a9
64 61070 46276 9 a9
65 61070 46276 9 a9
66 143580 159911 9 a9
67 143580 38958 9 a9
68 182323 159911 9 a9
The change is minimal, you only need to modify the step create dictionaries for nodes attributes of g. Currently, we only add attribute level to g. Now, we need to create another dictionary for name and add it to the new attribute name of nodes of g
#create the directed graph `g` using `from_pandas_edgelist`.
#You don't need `[level]` in this step
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())
#create a dictionary with unique key on `parent_id` and `child_id`
#`melt` keep 2 columns 'level', 'name' instead of one column 'level'
#dictionary `d_level` for attribute `level` of `g's` nodes
#dictionary `d_name` for attribute `level` of `g's` nodes
df1 = df.melt(['level', 'name'])
d_level = dict(zip(df1.value, df1.level))
d_name = dict(zip(df1.value, df1.name))
root_id = df.loc[0, 'parent_id']
d_level[root_id] = 0 #set `level` of root to `0`
#assign `d_level` to nodes attributes `level` of graph `g` and `d_name` for `name`
nx.set_node_attributes(g, d_level, 'level') #add `level` values to nodes of `g`
nx.set_node_attributes(g, d_name, 'name') #add `name` values to nodes of `g`
#use customize `tree_data_custom` defined previously
out = [tree_data_custom(g, root_id)]
If you adding many more columns to each row, it is better to create a single dictionary for all columns and apply one time to nodes of g as follows
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())
df1 = df.melt(['level', 'name'])
#this single dictionary to create both `level` and `name` attributes of nodes of `g`
d = {v: {'level': l, 'name': n} for v,l,n in zip(df1.value, df1.level, df1.name)}
root_id = df.loc[0, 'parent_id']
d[root_id]['level'] = 0 #set level of root to `0`
nx.set_node_attributes(g, d) #a single add for both attributes `level`, `name`
out = [tree_data_custom(g, root_id)]
Output
Out[260]:
[{'level': 0,
'name': 'a1',
'id': 125582,
'_children': [{'level': 1, 'name': 'a1', 'id': 214659},
{'level': 1, 'name': 'a1', 'id': 214633},
{'level': 1,
'name': 'a1',
'id': 214263,
'_children': [{'level': 2, 'name': 'a2', 'id': 131673},
{'level': 2, 'name': 'a2', 'id': 125579},
{'level': 2, 'name': 'a2', 'id': 125578},
{'level': 2,
'name': 'a2',
'id': 172670,
'_children': [{'level': 3, 'name': 'a3', 'id': 172669},
{'level': 3, 'name': 'a3', 'id': 174777},
{'level': 3,
'name': 'a3',
'id': 207661,
'_children': [{'level': 4, 'name': 'a4', 'id': 216529},
{'level': 4,
'name': 'a4',
'id': 223884,
'_children': [{'level': 5,
'name': 'a5',
'id': 223885,
'_children': [{'level': 6,
'name': 'a6',
'id': 229186,
'_children': [{'level': 7, 'name': 'a7', 'id': 219062},
{'level': 7,
'name': 'a7',
'id': 222243,
'_children': [{'level': 8, 'name': 'a8', 'id': 219187},
{'level': 8, 'name': 'a8', 'id': 245985},
{'level': 8, 'name': 'a8', 'id': 232393},
{'level': 8, 'name': 'a8', 'id': 247138},
{'level': 8, 'name': 'a8', 'id': 228848},
{'level': 8, 'name': 'a8', 'id': 233920},
{'level': 8, 'name': 'a8', 'id': 228113},
{'level': 8, 'name': 'a8', 'id': 233767},
{'level': 8, 'name': 'a8', 'id': 235407},
{'level': 8, 'name': 'a8', 'id': 237757},
{'level': 8, 'name': 'a8', 'id': 159091},
{'level': 8, 'name': 'a8', 'id': 214832},
{'level': 8, 'name': 'a8', 'id': 253990},
{'level': 8, 'name': 'a8', 'id': 231610},
{'level': 8,
'name': 'a8',
'id': 182323,
'_children': [{'level': 9, 'name': 'a9', 'id': 159911}]},
{'level': 8, 'name': 'a8', 'id': 242190},
{'level': 8,
'name': 'a8',
'id': 143580,
'_children': [{'level': 9, 'name': 'a9', 'id': 159911},
{'level': 9, 'name': 'a9', 'id': 38958}]},
{'level': 8, 'name': 'a8', 'id': 242188},
{'level': 8, 'name': 'a8', 'id': 143581},
{'level': 8, 'name': 'a8', 'id': 242187},
{'level': 8, 'name': 'a8', 'id': 143582},
{'level': 8, 'name': 'a8', 'id': 242189},
{'level': 8, 'name': 'a8', 'id': 205877},
{'level': 8, 'name': 'a8', 'id': 242823},
{'level': 8, 'name': 'a8', 'id': 140979},
{'level': 8, 'name': 'a8', 'id': 237824},
{'level': 8, 'name': 'a8', 'id': 149933},
{'level': 8, 'name': 'a8', 'id': 153625},
{'level': 8, 'name': 'a8', 'id': 8392},
{'level': 8, 'name': 'a8', 'id': 162085},
{'level': 8, 'name': 'a8', 'id': 150691},
{'level': 8, 'name': 'a8', 'id': 147773},
{'level': 8,
'name': 'a8',
'id': 61070,
'_children': [{'level': 9, 'name': 'a9', 'id': 46276}]},
{'level': 8, 'name': 'a8', 'id': 204850}]}]}]}]}]}]},
{'level': 2,
'name': 'a2',
'id': 214266,
'_children': [{'level': 3, 'name': 'a3', 'id': 216675},
{'level': 3, 'name': 'a3', 'id': 216671}]}]}]}]
Update 3: handle columns parent_name and child_name
You need to melt at once 4 columns parent_name, child_name, parent_id, parent_id. It is functionality of wide_to_long
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())
df1 = df.rename(lambda x: '_'.join(x.split('_')[::-1]), axis=1)
df1 = pd.wide_to_long(df1.reset_index(), stubnames=['id', 'name'],
i='index', j='type', suffix='\w+', sep='_')
d = {v: {'level': l, 'name': n} for v,l,n in zip(df1.id, df1.level, df1.name)}
root_id = df.loc[0, 'parent_id']
d[root_id]['level'] = 0
nx.set_node_attributes(g, d)
out = [tree_data_custom(g, root_id)]
Out[91]:
[{'level': 0,
'name': 'word1',
'id': 125582,
'_children': [{'level': 1, 'name': 'word6', 'id': 214659},
{'level': 1, 'name': 'word7', 'id': 214633},
{'level': 1,
'name': 'word2',
'id': 214263,
'_children': [{'level': 2, 'name': 'word8', 'id': 131673},
{'level': 2, 'name': 'word9', 'id': 125579},
{'level': 2, 'name': 'word10', 'id': 125578},
{'level': 2,
'name': 'word4',
'id': 172670,
'_children': [{'level': 3, 'name': 'word13', 'id': 172669},
{'level': 3, 'name': 'word14', 'id': 174777},
{'level': 3,
'name': 'word5',
'id': 207661,
'_children': [{'level': 4, 'name': 'word15', 'id': 216529},
{'level': 4,
'name': 'word16',
'id': 223884,
'_children': [{'level': 5,
'name': 'word17',
'id': 223885,
'_children': [{'level': 6,
'name': 'word18',
'id': 229186,
'_children': [{'level': 7, 'name': 'word19', 'id': 219062},
{'level': 7, 'name': 'word20', 'id': 222243}]}]}]}]}]},
{'level': 2,
'name': 'word3',
'id': 214266,
'_children': [{'level': 3, 'name': 'word11', 'id': 216675},
{'level': 3, 'name': 'word12', 'id': 216671}]}]}]}]