0

I have the following dataframe created by

import pandas as pd    
df = pd.DataFrame({'parent': ['AC1', 'AC2', 'AC3', 'AC1', 'AC11', 'AC5', 'AC5', 'AC6', 'AC8', 'AC9'],
                   'child': ['AC2', 'AC3', 'AC4', 'AC11', 'AC12', 'AC2', 'AC6', 'AC7', 'AC9', 'AC10']})

Which outputs the following:

    parent  child
0   AC1     AC2
1   AC2     AC3
2   AC3     AC4
3   AC1     AC11
4   AC11    AC12
5   AC5     AC2
6   AC5     AC6
7   AC6     AC7
8   AC8     AC9
9   AC9     AC10

I'd like to create a result dataframe where each of the parent (meaning it does not exist in the child column) is listed with the most final child(s).

df_result = pd.DataFrame({'parent': ['AC1', 'AC1', 'AC5', 'AC5', 'AC8', 'AC2'],
                     'child': ['AC4', 'AC12', 'AC4', 'AC7', 'AC10', 'AC4']})
    parent  child
0   AC1     AC4
1   AC1     AC12
2   AC5     AC4
3   AC5     AC7
4   AC8     AC10
5   AC2     AC4

I've started the following function but I'm not sure how to complete it.

def get_child(df):
result = {}
if df['parent'] not in df['child']:
    return result[df['parent']]
1
  • Your expected output seems a bit inconsistent with your description. You define parents as "not existing in the child column", yet AC2 is considered as a parent for the purposes of the output. Commented Aug 13, 2020 at 1:33

2 Answers 2

1

This is a tree structure, a particular type of graph. A data frame is not a particularly convenient way to represent a tree; I recommend that you switch to networkx or some other graph-based package. Then look up how to do a simple path traversal; you'll find direct support in the graph package documentation.

If you insist on doing this yourself -- which is a reasonable programming exercise -- you just need something like this pseudo-code

for each parent not in "child" column:
    here = parent
    while here in parent column:
        here = here["child"]

    record (parent, here) pair
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. But I need the result in a dataframe as I'm join back to other dataframes and possibly writing to Redshift. I will take a look at networkx.
Thanks for the clarification. That's why I gave you the traversal algorithm as well as some search terms.
0

While your expected output seems somewhat incongruent with your description (AC2 seems like it shouldn't be considered a parent since it isn't a source node), I'm pretty confident that you want to run a traversal from each source node to locate all of its leaves. Doing this in the dataframe isn't convenient, so we can use df.values and create an adjacency list dictionary to represent the graph. I assume there are no cycles in the graph.

import pandas as pd
from collections import defaultdict

def find_leaves(graph, src):
    if src in graph:
        for neighbor in graph[src]:
            yield from find_leaves(graph, neighbor)
    else:
        yield src

def pair_sources_to_leaves(df):
    graph = defaultdict(list)
    children = set()

    for parent, child in df.values:
        graph[parent].append(child)
        children.add(child)

    leaves = [[x, list(find_leaves(graph, x))] 
               for x in graph if x not in children]
    return (pd.DataFrame(leaves, columns=df.columns)
              .explode(df.columns[-1])
              .reset_index(drop=True))

if __name__ == "__main__":
    df = pd.DataFrame({
        "parent": ["AC1", "AC2", "AC3", "AC1", "AC11", 
                   "AC5", "AC5", "AC6", "AC8", "AC9"],
        "child": ["AC2", "AC3", "AC4", "AC11", "AC12", 
                  "AC2", "AC6", "AC7", "AC9", "AC10"]
    })
    print(pair_sources_to_leaves(df))

Output:

  parent child
0    AC1   AC4
1    AC1  AC12
2    AC5   AC4
3    AC5   AC7
4    AC8  AC10

2 Comments

Thank you. You are correct, it appears I did make a mistake with AC2. One thing I'm confused about is how to call the function. Do I build my df within it or was that just to show the example.
I'm not sure what you mean, exactly. Your input is the hardcoded df in main as you show in your question and the result is the final df after building the graph and running the traversals. I've updated the code to make it a little clearer that pair_sources_to_leaves makes this transformation as a black box--you may need to add parameters if you expect your columns to be dynamic.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.