Consider a tree and its DataFrame representation (left table):
0 ┌───────┬───────┐ ┌───────┬───────┐
├──1 │ id │ parent│ │ id │ path │
│ ├──2 ├───────┼───────┤ ├───────┼───────┤
│ └──3 │ 5 │ 0 │ │ 5 │0/5 │
│ └──4 ├───────┼───────┤ ├───────┼───────┤
└──5 │ 4 │ 3 │ │ 4 │0/1/3/4│
├───────┼───────┤ => ├───────┼───────┤
│ 3 │ 1 │ │ 3 │0/1/3 │
├───────┼───────┤ ├───────┼───────┤
│ 2 │ 1 │ │ 2 │0/1/2 │
├───────┼───────┤ ├───────┼───────┤
│ 1 │ 0 │ │ 1 │0/1 │
├───────┼───────┤ ├───────┼───────┤
│ 0 │ null │ │ 0 │0 │
└───────┴───────┘ └───────┴───────┘
What is the most efficient way to get a tree path (starting from the root) for each node of the tree (right table)?
All possible methods are allowed: SQL-queries, DataFrame methods, GraphX etc.
Note: classic SQL solution with recursive joins will not work for Spark DataFrames.
mapPartitions?