I have 2 dataframes that captured the hierarchy of the same dataset. Df1 is more complete compared to Df2, so I want to use Df1 as the standard to analyze if the hierarchy in Df2 is correct. However, both dataframes show the hierarchy in a bad way so it's hard to know the complete structure row by row.
Eg. Company A may have subsidiary: B, C, D, E and the relationship is A owns B owns C owns D owns E. In Df1, it may show:
| Ultimate Parent | Parent | Child |
| --------------- | ------ |-------|
| A | B | C |
| B | C | D | --> new
| C | D | E |
So if you break down to analyze row by row, the same entity can be shown as "Ultimate Parent" or "Child" at the same time, which makes it complicated.
On the other hand, as Df2 is incomplete, so it won't have all the data (A, B, C, D, E). It will only contain partial data, eg. A, D, E in this case, so the dataframe will look like this
| Ultimate Parent | Parent | Child |
| --------------- | ------ |-------|
| A | D | E |
Now I want to (1) use Df1 to get the correct/complete hierarchy (2) compare and identify the gap between Df1 and Df2. The logic is as following:
If A owns B owns C owns D owns E and Df1 looks like this
| Ultimate Parent | Parent | Child |
| --------------- | ------ |-------|
| A | B | C |
| C | D | E |
I want to add 1 column to put all the related entities together and in order from ultimate parent to child
| Ultimate Parent | Parent | Child | Hierarchy |
| --------------- | ------ |-------|-------------|
| A | B | C |A, B, C, D, E|
| C | D | E |A, B, C, D, E|
And then compare this Df1 with Df2 and add a column to Df2 to identify the gap. The most ideal (but optional) situation is to have another column stating the reason if it's wrong.
| Ultimate Parent | Parent | Child | Right/Wrong| Reason |
| --------------- | ------ |-------|------------|-----------------|
| A | D | E | Right | |
| C | B | A | Wrong | wrong hierarchy |
| C | A | B | Wrong | wrong hierarchy | --> new
| G | A | B | Wrong | wrong entities | --> new
| A | F | G | Wrong | wrong entities |
I have tried multiple string matching methods, but I'm stuck in the step and idea where I think order matters but I don't know how to compare strings in order when they're related but scattered in different rows.
