Consider the following code
question = spark.createDataFrame([{'A':1,'B':5},{'A':2,'B':5},
{'A':3,'B':5},{'A':3,'B':6}])
#+---+---+
#| A| B|
#+---+---+
#| 1| 5|
#| 2| 5|
#| 3| 5|
#| 3| 6|
#+---+---+
How can I create a spark dataframe that looks as follows :
solution = spark.createDataFrame([{'C':1,'D':2},{'C':1,'D':3},
{'C':2,'D':3},{'C':5,'D':6}])
#+---+---+
#| C| D|
#+---+---+
#| 1| 2|
#| 1| 3|
#| 2| 3|
#| 5| 6|
#+---+---+
This is the notion of triadic closure, where I am connecting the third edge of the triangle based upon which edges are already connected.
I must have (1,2) since (1,5) and (2,5) are present, I must have (1,3) since (1,5) and (3,5) are present, and I must have (2,3) since (2,5) and (3,5) are present. I must have (5,6) since (3,5) and (3,6) are present (an edge in both directions). There should NOT be an additional entry for (5,6) since no two pairs from A map to 6. Since there isn't a second instance in A that maps to 6, (5,6) does not get added.