2

I have 2 dataframes that captured the hierarchy of the same dataset. Df1 is more complete compared to Df2, so I want to use Df1 as the standard to analyze if the hierarchy in Df2 is correct. However, both dataframes show the hierarchy in a bad way so it's hard to know the complete structure row by row.

Eg. Company A may have subsidiary: B, C, D, E and the relationship is A owns B owns C owns D owns E. In Df1, it may show:

| Ultimate Parent | Parent | Child |
| --------------- | ------ |-------|
| A               | B      | C     |
| B               | C      | D     | --> new
| C               | D      | E     |

So if you break down to analyze row by row, the same entity can be shown as "Ultimate Parent" or "Child" at the same time, which makes it complicated.

On the other hand, as Df2 is incomplete, so it won't have all the data (A, B, C, D, E). It will only contain partial data, eg. A, D, E in this case, so the dataframe will look like this

| Ultimate Parent | Parent | Child |
| --------------- | ------ |-------|
| A               | D      | E     |

Now I want to (1) use Df1 to get the correct/complete hierarchy (2) compare and identify the gap between Df1 and Df2. The logic is as following:

If A owns B owns C owns D owns E and Df1 looks like this

| Ultimate Parent | Parent | Child |
| --------------- | ------ |-------|
| A               | B      | C     |
| C               | D      | E     |

I want to add 1 column to put all the related entities together and in order from ultimate parent to child

| Ultimate Parent | Parent | Child | Hierarchy   |
| --------------- | ------ |-------|-------------|
| A               | B      | C     |A, B, C, D, E|
| C               | D      | E     |A, B, C, D, E|

And then compare this Df1 with Df2 and add a column to Df2 to identify the gap. The most ideal (but optional) situation is to have another column stating the reason if it's wrong.

| Ultimate Parent | Parent | Child | Right/Wrong| Reason          | 
| --------------- | ------ |-------|------------|-----------------|
| A               | D      | E     | Right      |                 |
| C               | B      | A     | Wrong      | wrong hierarchy |
| C               | A      | B     | Wrong      | wrong hierarchy | --> new 
| G               | A      | B     | Wrong      | wrong entities  | --> new
| A               | F      | G     | Wrong      | wrong entities  |

I have tried multiple string matching methods, but I'm stuck in the step and idea where I think order matters but I don't know how to compare strings in order when they're related but scattered in different rows.

1
  • 1
    Can you please post what you have tried so far? Commented Apr 20, 2024 at 18:54

3 Answers 3

1

Basically, you'll need to build a network graph of df1 to get a comprejhension map of the hierarchies. Once this is done, you need to compare the hierarchies of df2 with those of df1 and finally validate. To do so, you can define function. You'll create a new column hierarchies to df1 and Right/Wrong, Reason to df2. .

import pandas as pd
import networkx as nx

data1 = {
    'Ultimate Parent': ['A', 'C'],
    'Parent': ['B', 'D'],
    'Child': ['C', 'E']
}
df1 = pd.DataFrame(data1)

data2 = {
    'Ultimate Parent': ['A', 'C', 'A'],
    'Parent': ['D', 'B', 'F'],
    'Child': ['E', 'A', 'G']
}
df2 = pd.DataFrame(data2)

G = nx.DiGraph()
for _, row in df1.iterrows():
    G.add_edge(row['Parent'], row['Child'])
    if row['Ultimate Parent'] != row['Parent']:  
        G.add_edge(row['Ultimate Parent'], row['Parent'])

def complete_hierarchy(node, graph):
    descendants = nx.descendants(graph, node)  
    descendants.add(node)  
    return ', '.join(sorted(descendants))

df1['Hierarchy'] = df1['Ultimate Parent'].apply(lambda x: complete_hierarchy(x, G))

def validate_row(row, hierarchy_df, graph):
    filtered_hierarchy = hierarchy_df[hierarchy_df['Ultimate Parent'] == row['Ultimate Parent']]
    if filtered_hierarchy.empty:
        return pd.Series(["Wrong", "wrong entities"])  
    
    full_hierarchy = filtered_hierarchy.iloc[0]['Hierarchy']
    hierarchy_elements = set(full_hierarchy.split(', '))
    
    if set([row['Parent'], row['Child']]).issubset(graph.nodes()):
        if row['Parent'] not in hierarchy_elements or row['Child'] not in hierarchy_elements:
            return pd.Series(["Wrong", "wrong hierarchy"])
        elif f"{row['Parent']}, {row['Child']}" not in full_hierarchy:
            return pd.Series(["Wrong", "wrong hierarchy"])
        else:
            return pd.Series(["Right", ""])
    else:
        return pd.Series(["Wrong", "wrong entities"])

df2[['Right/Wrong', 'Reason']] = df2.apply(lambda row: validate_row(row, df1, G), axis=1)

print("Df1 - Complete Hierarchy:")
print(df1)
print("\nDf2 - Validation Results:")
print(df2)


Which gives you

Df1 - Complete Hierarchy:
  Ultimate Parent Parent Child      Hierarchy
0               A      B     C  A, B, C, D, E
1               C      D     E        C, D, E

Df2 - Validation Results:
  Ultimate Parent Parent Child Right/Wrong           Reason
0               A      D     E       Right                 
1               C      B     A       Wrong  wrong hierarchy
2               A      F     G       Wrong   wrong entities
Sign up to request clarification or add additional context in comments.

4 Comments

If this is the answer you were looking for, do mark it as accepted so it disappears from the lsit of unanswered questions.
I would also use networkx with with a simpler logic. I also think the "wrong entities" reason is correct (see my answer).
@SergedeGossondeVarennes I am trying your method now, but I keep getting a memory issue, which I guess it's due to the data size. I will mark the final answer when done. Thanks!
@SergedeGossondeVarennes Hi, I checked your function will a smaller dataset, which works 100% as I want. Do you know if there's any way for me to efficiently analyze a much larger dataset? Says around 30k rows in both dataframes?
1

I would use networkx to form a directed graph from df1, and then a simple check of presence of the edges or nodes:

import networkx as nx

# create a graph from df1
pairs = (['Ultimate Parent', 'Parent'], ['Parent', 'Child'])
G1 = nx.compose(*(nx.from_pandas_edgelist(df1, source=x[0], target=x[1],
                                          create_using=nx.DiGraph)
                for x in pairs))

# if any of the 3 nodes is missing: wrong entities
# else,
# for Ultimate Parent -> Parent, check if a path exists
# for Parent -> Child, check if a direct path (edge) exists
# is both are True: valid
# else: wrong hierarchy
def check(G, up, p, c):
    if {up, p, c}.issubset(G):
        if nx.has_path(G, up, p) and (p, c) in G.edges:
            return # check is valid
        else:
            return 'wrong hierarchy'
    return 'wrong entities'

# initialize "Right/Wrong" to be before "Reason" (optional)
df2['Right/Wrong'] = None
# check if the row is valid
df2['Reason'] = [check(G1, *x) for x in
                 zip(df2['Ultimate Parent'], df2['Parent'], df2['Child'])]
# "Right" if the check is valid (None) else "Wrong"
df2['Right/Wrong'] = np.where(df2['Reason'].isna(), 'Right', 'Wrong')

Output:

  Ultimate Parent Parent Child Right/Wrong           Reason
0               A      D     E       Right             None
1               C      B     A       Wrong  wrong hierarchy
2               C      A     B       Wrong  wrong hierarchy
3               G      A     B       Wrong   wrong entities
4               A      F     G       Wrong   wrong entities

Adding the hierarchy to df1:

cc = {frozenset(c): ','.join(nx.topological_sort(H))
      if nx.is_directed_acyclic_graph(H:=G1.subgraph(c))
      else 'cyclic hierarchy'
      for c in nx.weakly_connected_components(G1)}

df1['Hierarchy'] = df1['Child'].map({n: x for c,x in cc.items() for n in c})

Output:

  Ultimate Parent Parent Child         Hierarchy
0               A      B     C         A,B,C,D,E
1               B      C     D         A,B,C,D,E
2               C      D     E         A,B,C,D,E
3               X      Y     Z  cyclic hierarchy
4               Z      X     Y  cyclic hierarchy

Reproducible input:

df1 = pd.DataFrame({'Ultimate Parent': ['A', 'B', 'C', 'X', 'Z'],
                    'Parent': ['B', 'C', 'D', 'Y', 'X'],
                    'Child': ['C', 'D', 'E', 'Z', 'Y']})

df2 = pd.DataFrame({'Ultimate Parent': ['A', 'C', 'C', 'G', 'A'],
                    'Parent': ['D', 'B', 'A', 'A', 'F'],
                    'Child': ['E', 'A', 'B', 'B', 'G']})

Graph (df1):

enter image description here

11 Comments

awesome outcome. Exactly what I need for df2. But do you have any idea how to transform the Graph (df1) into df1's column showing their hierarchy (just like my original post)? Thanks!
@LH sure, see update. Note that adding the hierarchy to all rows might be uselessely duplicating information. Also be aware that the hierarchy here suppose that the subgraphs are not branched. If they are pretty me know and I'll update, but this would require more computations.
you're right. Some info will be duplicated but I can't figure a better way to do it so far. If I simply assign a group ID to each family, it can't show their relationship, although I guess I can group by ultimate parent->parent and get an idea. I've got another error from your codes above saying "networkXUnfeasible: Graph contains a cycle or graph changed during iteration". Will this error come from data cleaning issue? Eg. I have some rows having same value in all 3 columns if they're ultimate parent. I can remove them if needed but I'm not sure whether that's what this error means.
also, I just realized your method only check the parent and child pair. Therefore, for C, A, B --> this should be wrong hierarchy, but will be considered correct in your function
Can you update your example to reflect these two cases (cycle and different hierarchy)?
|
0

This is just a simplification of Mozway's answer. Hierarchy code can also be written as follows :

# Define weakly connected components
weakly_connected = nx.weakly_connected_components(G)

hierarchy_dict = {
node : hierarchy    
for wc in weakly_connected    
for subgraph in [G.subgraph(wc)]
for hierarchy in [','.join(nx.topological_sort(subgraph)) 
                  if nx.is_directed_acyclic_graph(subgraph) else 'Cyclic_Hierarchy']    
for node  in subgraph    
}
df1['Hierarchy'] = df1['Child'].map(hierarchy_dict)
print(df1)
"""
  Dadu Parent Child         Hierarchy
0    A      B     C         A,B,C,D,E
1    B      C     D         A,B,C,D,E
2    C      D     E         A,B,C,D,E
3    X      Y     Z  Cyclic_Hierarchy
4    Z      X     Y  Cyclic_Hierarchy
"""

# Vectorized hierarchy check using list comprehension and ternary operator
def check_row(row):
        up, p, c = row['Dadu'], row['Parent'], row['Child']
        if  {up,p,c}.issubset(G): #all(node in G for node in [up, p, c]):
            if nx.has_path(G, up, p) and (p, c) in G.edges:
                return 'Right'  # Valid hierarchy
            else:
                return 'wrong hierarchy'
        else:
            return 'wrong entities'

#OR
def check_row2(row):
    up, p, c = row['Dadu'], row['Parent'], row['Child']
    #if up not in G or p not in G or c not in G:
    if not {up, p, c}.issubset(G):    
        return 'wrong entities'
    if not nx.has_path(G, up, p) or (p, c) not in G.edges:
        return 'wrong hierarchy'
    return 'Right'

df2['Reason'] = df2.apply(check_row, axis=1)
df2['Right/Wrong'] = np.where(df2['Reason'] == 'Right', 'Right', 'Wrong')
print(df2)
"""
 Dadu Parent Child           Reason   Right/Wrong
0    A      D     E            Right       Right
1    C      B     A  wrong hierarchy       Wrong
2    C      A     B  wrong hierarchy       Wrong
3    G      A     B   wrong entities       Wrong
4    A      F     G   wrong entities       Wrong
"""
     

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.