1

I have a pandas DataFrame that contains hierarchical elements from my database. parent_id columns tells me to what parent each child belongs. I would like to generate a nested JSON from this DataFrame (see section 3 for needed format). I tried many things, but I am mainly stuck on the _children list field in each parent, which should contain all its children. Any help is more than welcome, Thanks!

1. pandas DataFrame (WHAT I HAVE)

parent_id    child_id   level
125582       214659     1
125582       214633     1
125582       214263     1
214263       131673     2
214263       125579     2
214263       125578     2
214263       172670     2
214263       214266     2
214266       216675     3
214266       216671     3
172670       172669     3
172670       174777     3
172670       207661     3
207661       216529     4
207661       223884     4
223884       223885     5
223885       229186     6
229186       219062     7
229186       222243     7
and so on...

2. DataFrame logic

125582 (level 0 - root)
--214659 (level 1)
--214633 (level 1)
--214263 (level 1)
----131673 (level 2)
----125579 (level 2)
----125578 (level 2)
----172670 (level 2)
------172669 (level 3)
------174777 (level 3)
------207661 (level 3)
--------216529 (level 4)
--------223884 (level 4)
----------223885 (level 5)
------------229186 (level 6)
--------------219062 (level 7)    
--------------222243 (level 7)
----214266 (level 2)
------216675 (level 3)
------216671 (level 3)
and so on...

3. JSON (WHAT I NEED)

[
    {id: 125582, level: 0, "_children": [
        {id: 214659, level: 1},
        {id: 214633, level: 1},
        {id: 214263, level: 1, "_children": [
            {id: 131673, level: 2},
            {id: 125579, level: 2},
            {id: 125578, level: 2},
            {id: 172670, level: 2, "_children": [
                {id: 172669, level: 3},
                {id: 174777, level: 3},
                {id: 207661, level: 3, "_children": [
                    {id: 216529, level: 4},
                    {id: 223884, level: 4, "_children": [
                        {id: 223885, level: 5, "_children": [
                            {id: 229186, level: 6, "_children": [
                                {id: 219062, level: 7},
                                {id: 222243, level: 7}
                            ]}
                        ]}
                    ]}
                ]}
            ]},
            {id: 214266, level: 2, "_children": [
                {id: 216675, level: 3},
                {id: 216671, level: 3}
            ]}
        ]},
        and so on...
    ]}
]
3
  • Anyone any idea? :( Commented Mar 10, 2020 at 22:25
  • 1
    This is a very vague question... are you trying to do this one time? Repetitively? Where are you getting the data from? This is a simple programming problem where you iterate through the list and build the output structure you need. Commented Mar 11, 2020 at 2:41
  • 1
    @MikeSchmidt - I've added section 2 in an attempt to explain my problem in more detail. For clarity, section 1 is what I have, section 3 is what I need. Could you please take a look one more time? Much appreciated! Commented Mar 11, 2020 at 4:13

3 Answers 3

4
+200

This hierarchical json has parent->child relationship. Therefore, I suggest you use networkx package to construct a graph g from the dateframe. From graph g create the desired json.

You need to pip install networkx following this instruction

After that, do the following steps:

  • import networkx as nx
  • create the directed graph g using from_pandas_edgelist
  • create a dictionary d with unique key on parent_id and child_id with values being as level
  • assign d to nodes attributes level of graph g
  • finally, call nx.tree_data on g

The code as follows:

import networkx as nx

#create the directed graph `g` using `from_pandas_edgelist`
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', ['level'], 
                            create_using=nx.DiGraph())

#create a dictionary with unique key on `parent_id` and `child_id`
df1 = df.melt('level')
d = dict(zip(df1.value, df1.level))
root_id = df.loc[0, 'parent_id']
d[root_id] = 0  #set level of root to `0`

#assign `d` to nodes attributes `level` of graph `g`
nx.set_node_attributes(g, d, 'level')    #add `level` values to nodes of `g`

out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]

Out[119]:
[{'level': 0,
  'id': 125582,
  '_children': [{'level': 1, 'id': 214659},
   {'level': 1, 'id': 214633},
   {'level': 1,
    'id': 214263,
    '_children': [{'level': 2, 'id': 131673},
     {'level': 2, 'id': 125579},
     {'level': 2, 'id': 125578},
     {'level': 2,
      'id': 172670,
      '_children': [{'level': 3, 'id': 172669},
       {'level': 3, 'id': 174777},
       {'level': 3,
        'id': 207661,
        '_children': [{'level': 4, 'id': 216529},
         {'level': 4,
          'id': 223884,
          '_children': [{'level': 5,
            'id': 223885,
            '_children': [{'level': 6,
              'id': 229186,
              '_children': [{'level': 7, 'id': 219062},
               {'level': 7, 'id': 222243}]}]}]}]}]},
     {'level': 2,
      'id': 214266,
      '_children': [{'level': 3, 'id': 216675},
       {'level': 3, 'id': 216671}]}]}]}]

Note: the output above is the dictionary inside of list as your desired output. My solution output puts the key level in front of key id, but the hierarchical/structural of sub-dictionaries is the same as your desired output


On the other hand, if you don't need the level values in the desired output, just directly use the built-in nx.tree_data function to return the the output as follows

import networkx as nx

g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', ['level'], 
                            create_using=nx.DiGraph())

root_id = df.loc[0, 'parent_id']    
out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]

Out[168]:
[{'id': 125582,
  '_children': [{'id': 214659},
   {'id': 214633},
   {'id': 214263,
    '_children': [{'id': 131673},
     {'id': 125579},
     {'id': 125578},
     {'id': 172670,
      '_children': [{'id': 172669},
       {'id': 174777},
       {'id': 207661,
        '_children': [{'id': 216529},
         {'id': 223884,
          '_children': [{'id': 223885,
            '_children': [{'id': 229186,
              '_children': [{'id': 219062}, {'id': 222243}]}]}]}]}]},
     {'id': 214266, '_children': [{'id': 216675}, {'id': 216671}]}]}]}]

Update:

Use this modified function of nx.tree_data to overcome the restriction on single parent per child. Add this customized function to your code and call it instead of nx.tree_data

from itertools import chain

def tree_data_custom(G, root):
    id_ = 'id'
    children = '_children'
    def add_children(n, G):
        nbrs = G[n]
        if len(nbrs) == 0:
            return []
        children_ = []
        for child in nbrs:
            d = dict(chain(G.nodes[child].items(), [(id_, child)]))
            c = add_children(child, G)
            if c:
                d[children] = c
            children_.append(d)
        return children_

    data = dict(chain(G.nodes[root].items(), [(id_, root)]))    
    data[children] = add_children(root, G)
    return data

Instead of call out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})], just call tree_data_custom as follows

out =  [tree_data_custom(g, root_id)]

Update 2: Adding name column to dataframe

sample df where children has multiple parents

Out[258]:
    parent_id  child_id  level name
0      125582    214659      1   a1
1      125582    214633      1   a1
2      125582    214263      1   a1
3      214263    131673      2   a2
4      214263    125579      2   a2
5      214263    125578      2   a2
6      214263    172670      2   a2
7      214263    214266      2   a2
8      214266    216675      3   a3
9      214266    216671      3   a3
10     172670    172669      3   a3
11     172670    174777      3   a3
12     172670    207661      3   a3
13     207661    216529      4   a4
14     207661    223884      4   a4
15     223884    223885      5   a5
16     223885    229186      6   a6
17     229186    219062      7   a7
18     229186    222243      7   a7
19     222243    219187      8   a8
20     222243    245985      8   a8
21     222243    232393      8   a8
22     222243    247138      8   a8
23     222243    228848      8   a8
24     222243    228848      8   a8
25     222243    233920      8   a8
26     222243    233920      8   a8
27     222243    228113      8   a8
28     222243    233767      8   a8
29     222243    235407      8   a8
30     222243    237757      8   a8
31     222243    159091      8   a8
32     222243    159091      8   a8
33     222243    214832      8   a8
34     222243    253990      8   a8
35     222243    231610      8   a8
36     222243    231610      8   a8
37     222243    182323      8   a8
38     222243    242190      8   a8
39     222243    143580      8   a8
40     222243    242188      8   a8
41     222243    143581      8   a8
42     222243    242187      8   a8
43     222243    143582      8   a8
44     222243    242189      8   a8
45     222243    205877      8   a8
46     222243    242823      8   a8
47     222243    140979      8   a8
48     222243    237824      8   a8
49     222243    149933      8   a8
50     222243    149933      8   a8
51     222243    153625      8   a8
52     222243      8392      8   a8
53     222243    162085      8   a8
54     222243    162085      8   a8
55     222243    150691      8   a8
56     222243    147773      8   a8
57     222243    147773      8   a8
58     222243     61070      8   a8
59     222243     61070      8   a8
60     222243    204850      8   a8
61     222243    204850      8   a8
62      61070     46276      9   a9
63      61070     46276      9   a9
64      61070     46276      9   a9
65      61070     46276      9   a9
66     143580    159911      9   a9
67     143580     38958      9   a9
68     182323    159911      9   a9

The change is minimal, you only need to modify the step create dictionaries for nodes attributes of g. Currently, we only add attribute level to g. Now, we need to create another dictionary for name and add it to the new attribute name of nodes of g

#create the directed graph `g` using `from_pandas_edgelist`. 
#You don't need `[level]` in this step
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())

#create a dictionary with unique key on `parent_id` and `child_id`
#`melt` keep 2 columns 'level', 'name' instead of one column 'level'
#dictionary `d_level` for attribute `level` of `g's` nodes
#dictionary `d_name` for attribute `level` of `g's` nodes
df1 = df.melt(['level', 'name'])
d_level = dict(zip(df1.value, df1.level))
d_name = dict(zip(df1.value, df1.name))
root_id = df.loc[0, 'parent_id']
d_level[root_id] = 0  #set `level` of root to `0`

#assign `d_level` to nodes attributes `level` of graph `g` and `d_name` for `name`
nx.set_node_attributes(g, d_level, 'level')    #add `level` values to nodes of `g`
nx.set_node_attributes(g, d_name, 'name')    #add `name` values to nodes of `g`

#use customize `tree_data_custom` defined previously
out = [tree_data_custom(g, root_id)] 

If you adding many more columns to each row, it is better to create a single dictionary for all columns and apply one time to nodes of g as follows

g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())

df1 = df.melt(['level', 'name'])
#this single dictionary to create both `level` and `name` attributes of nodes of `g`
d = {v: {'level': l, 'name': n} for v,l,n in zip(df1.value, df1.level, df1.name)}
root_id = df.loc[0, 'parent_id']
d[root_id]['level'] = 0  #set level of root to `0`

nx.set_node_attributes(g, d)    #a single add for both attributes `level`, `name`

out = [tree_data_custom(g, root_id)] 

Output

Out[260]:
[{'level': 0,
  'name': 'a1',
  'id': 125582,
  '_children': [{'level': 1, 'name': 'a1', 'id': 214659},
   {'level': 1, 'name': 'a1', 'id': 214633},
   {'level': 1,
    'name': 'a1',
    'id': 214263,
    '_children': [{'level': 2, 'name': 'a2', 'id': 131673},
     {'level': 2, 'name': 'a2', 'id': 125579},
     {'level': 2, 'name': 'a2', 'id': 125578},
     {'level': 2,
      'name': 'a2',
      'id': 172670,
      '_children': [{'level': 3, 'name': 'a3', 'id': 172669},
       {'level': 3, 'name': 'a3', 'id': 174777},
       {'level': 3,
        'name': 'a3',
        'id': 207661,
        '_children': [{'level': 4, 'name': 'a4', 'id': 216529},
         {'level': 4,
          'name': 'a4',
          'id': 223884,
          '_children': [{'level': 5,
            'name': 'a5',
            'id': 223885,
            '_children': [{'level': 6,
              'name': 'a6',
              'id': 229186,
              '_children': [{'level': 7, 'name': 'a7', 'id': 219062},
               {'level': 7,
                'name': 'a7',
                'id': 222243,
                '_children': [{'level': 8, 'name': 'a8', 'id': 219187},
                 {'level': 8, 'name': 'a8', 'id': 245985},
                 {'level': 8, 'name': 'a8', 'id': 232393},
                 {'level': 8, 'name': 'a8', 'id': 247138},
                 {'level': 8, 'name': 'a8', 'id': 228848},
                 {'level': 8, 'name': 'a8', 'id': 233920},
                 {'level': 8, 'name': 'a8', 'id': 228113},
                 {'level': 8, 'name': 'a8', 'id': 233767},
                 {'level': 8, 'name': 'a8', 'id': 235407},
                 {'level': 8, 'name': 'a8', 'id': 237757},
                 {'level': 8, 'name': 'a8', 'id': 159091},
                 {'level': 8, 'name': 'a8', 'id': 214832},
                 {'level': 8, 'name': 'a8', 'id': 253990},
                 {'level': 8, 'name': 'a8', 'id': 231610},
                 {'level': 8,
                  'name': 'a8',
                  'id': 182323,
                  '_children': [{'level': 9, 'name': 'a9', 'id': 159911}]},
                 {'level': 8, 'name': 'a8', 'id': 242190},
                 {'level': 8,
                  'name': 'a8',
                  'id': 143580,
                  '_children': [{'level': 9, 'name': 'a9', 'id': 159911},
                   {'level': 9, 'name': 'a9', 'id': 38958}]},
                 {'level': 8, 'name': 'a8', 'id': 242188},
                 {'level': 8, 'name': 'a8', 'id': 143581},
                 {'level': 8, 'name': 'a8', 'id': 242187},
                 {'level': 8, 'name': 'a8', 'id': 143582},
                 {'level': 8, 'name': 'a8', 'id': 242189},
                 {'level': 8, 'name': 'a8', 'id': 205877},
                 {'level': 8, 'name': 'a8', 'id': 242823},
                 {'level': 8, 'name': 'a8', 'id': 140979},
                 {'level': 8, 'name': 'a8', 'id': 237824},
                 {'level': 8, 'name': 'a8', 'id': 149933},
                 {'level': 8, 'name': 'a8', 'id': 153625},
                 {'level': 8, 'name': 'a8', 'id': 8392},
                 {'level': 8, 'name': 'a8', 'id': 162085},
                 {'level': 8, 'name': 'a8', 'id': 150691},
                 {'level': 8, 'name': 'a8', 'id': 147773},
                 {'level': 8,
                  'name': 'a8',
                  'id': 61070,
                  '_children': [{'level': 9, 'name': 'a9', 'id': 46276}]},
                 {'level': 8, 'name': 'a8', 'id': 204850}]}]}]}]}]}]},
     {'level': 2,
      'name': 'a2',
      'id': 214266,
      '_children': [{'level': 3, 'name': 'a3', 'id': 216675},
       {'level': 3, 'name': 'a3', 'id': 216671}]}]}]}]

Update 3: handle columns parent_name and child_name

You need to melt at once 4 columns parent_name, child_name, parent_id, parent_id. It is functionality of wide_to_long

g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())

df1 = df.rename(lambda x: '_'.join(x.split('_')[::-1]), axis=1)
df1 = pd.wide_to_long(df1.reset_index(), stubnames=['id', 'name'],
                i='index', j='type', suffix='\w+', sep='_')
d = {v: {'level': l, 'name': n} for v,l,n in zip(df1.id, df1.level, df1.name)}
root_id = df.loc[0, 'parent_id']
d[root_id]['level'] = 0

nx.set_node_attributes(g, d)
out = [tree_data_custom(g, root_id)] 

Out[91]:
[{'level': 0,
  'name': 'word1',
  'id': 125582,
  '_children': [{'level': 1, 'name': 'word6', 'id': 214659},
   {'level': 1, 'name': 'word7', 'id': 214633},
   {'level': 1,
    'name': 'word2',
    'id': 214263,
    '_children': [{'level': 2, 'name': 'word8', 'id': 131673},
     {'level': 2, 'name': 'word9', 'id': 125579},
     {'level': 2, 'name': 'word10', 'id': 125578},
     {'level': 2,
      'name': 'word4',
      'id': 172670,
      '_children': [{'level': 3, 'name': 'word13', 'id': 172669},
       {'level': 3, 'name': 'word14', 'id': 174777},
       {'level': 3,
        'name': 'word5',
        'id': 207661,
        '_children': [{'level': 4, 'name': 'word15', 'id': 216529},
         {'level': 4,
          'name': 'word16',
          'id': 223884,
          '_children': [{'level': 5,
            'name': 'word17',
            'id': 223885,
            '_children': [{'level': 6,
              'name': 'word18',
              'id': 229186,
              '_children': [{'level': 7, 'name': 'word19', 'id': 219062},
               {'level': 7, 'name': 'word20', 'id': 222243}]}]}]}]}]},
     {'level': 2,
      'name': 'word3',
      'id': 214266,
      '_children': [{'level': 3, 'name': 'word11', 'id': 216675},
       {'level': 3, 'name': 'word12', 'id': 216671}]}]}]}]
Sign up to request clarification or add additional context in comments.

22 Comments

Would your first solution performance wise be more efficient than James’s recursive solution when I am dealing with 1 million rows?
@dunkubok: Honestly, I don't know. I have never tried networkx on that big of dataset. It would be great if you try both and share the timing on your real dataset. Perhaps, the benchmark on several different ops from this link timlrx.com/2019/05/05/… , the result of pokec, 1.6m nodes, 30.6m edges dataset would give you a glimpse on networkx performance.
This is awesome!
Performance-wise it appears to be ok. 10 seconds for 800k records
Thank you very much for this extensive answer. This helped me a lot!
|
2
id_to_obj = {}
roots = []
def parse(row):
    pid, cid, lvl = row.parent_id, row.child_id, row.level
    if pid not in id_to_obj:
        # parent not created yet: create parent as root
        id_to_obj[pid] = {
            'id': pid, 'level': 0
        }
        roots.append(pid)

    # Get parent object
    parent = id_to_obj[pid]
    if not '_children' in parent:
        parent['_children'] = []

    # Create child object and add to parent's children list
    child = {
        'id': cid, 'level': lvl
    }
    parent['_children'].append(child)
    id_to_obj[cid] = child

df.sort_values(by='level').apply(parse, axis=1)

# Result
[id_to_obj[pid] for pid in roots]

2 Comments

Thanks a lot for sharing! This approach takes approx. 42 seconds for 740k rows.
@phi thank you so much for this approach I was struggling with a similar issue, I have only done doubt that I have a huge data frame in which column name contains the date,how to produce that such JSON
2

It's not a pandas one-liner, but here's a recursive function that should do the trick.

import json

import pandas as pd

df = pd.DataFrame(
    [{"parent_id": 125582, "child_id": 214659, "level": 1}, {"parent_id": 125582, "child_id": 214633, "level": 1},
     {"parent_id": 125582, "child_id": 214263, "level": 1}, {"parent_id": 214263, "child_id": 131673, "level": 2},
     {"parent_id": 214263, "child_id": 125579, "level": 2}, {"parent_id": 214263, "child_id": 125578, "level": 2},
     {"parent_id": 214263, "child_id": 172670, "level": 2}, {"parent_id": 214263, "child_id": 214266, "level": 2},
     {"parent_id": 214266, "child_id": 216675, "level": 3}, {"parent_id": 214266, "child_id": 216671, "level": 3},
     {"parent_id": 172670, "child_id": 172669, "level": 3}, {"parent_id": 172670, "child_id": 174777, "level": 3},
     {"parent_id": 172670, "child_id": 207661, "level": 3}, {"parent_id": 207661, "child_id": 216529, "level": 4},
     {"parent_id": 207661, "child_id": 223884, "level": 4}, {"parent_id": 223884, "child_id": 223885, "level": 5},
     {"parent_id": 223885, "child_id": 229186, "level": 6}, {"parent_id": 229186, "child_id": 219062, "level": 7},
     {"parent_id": 229186, "child_id": 222243, "level": 7}]
)


def recurse(parent_id, level):
    # create the base result
    result = {"id": int(parent_id), "level": int(level)}

    # get all of the children of this parent, one level below this one
    children = df[(df.parent_id == parent_id) & (df["level"] == level + 1)]

    # if there are no such children, then return without a _children key
    if children.empty:
        return result

    # otherwise, recurse on each child_id
    result["_children"] = [recurse(child_id, level + 1) for child_id in sorted(children.child_id.unique())]

    return result


# start at an implicit level 0
tree = [recurse(parent_id, 0) for parent_id in sorted(df[df["level"] == 1].parent_id.unique())]

# convert to JSON
print(json.dumps(tree, indent=3, sort_keys=True))

2 Comments

This code returns 3 root elements with the same id: i.imgur.com/aZPsmSu.png Any idea how to solve that?
I added a fix, but performance testing suggests it's much too slow for the scale you're talking about.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.