Spark: build recursive tree path for every node of an hierarchy DataFrame

Question

Consider a tree and its DataFrame representation (left table):

0             ┌───────┬───────┐           ┌───────┬───────┐
├──1          │   id  │ parent│           │   id  │ path  │
│  ├──2       ├───────┼───────┤           ├───────┼───────┤
│  └──3       │   5   │   0   │           │   5   │0/5    │
│     └──4    ├───────┼───────┤           ├───────┼───────┤
└──5          │   4   │   3   │           │   4   │0/1/3/4│
              ├───────┼───────┤     =>    ├───────┼───────┤
              │   3   │   1   │           │   3   │0/1/3  │
              ├───────┼───────┤           ├───────┼───────┤
              │   2   │   1   │           │   2   │0/1/2  │
              ├───────┼───────┤           ├───────┼───────┤
              │   1   │   0   │           │   1   │0/1    │
              ├───────┼───────┤           ├───────┼───────┤
              │   0   │ null  │           │   0   │0      │
              └───────┴───────┘           └───────┴───────┘

What is the most efficient way to get a tree path (starting from the root) for each node of the tree (right table)?

All possible methods are allowed: SQL-queries, DataFrame methods, GraphX etc.

Note: classic SQL solution with recursive joins will not work for Spark DataFrames.

I suspect GraphX would be the way to go but I doubt it would be very efficient. — Shaido
– Shaido, Commented Apr 11, 2019 at 9:06
Yes, it seems this task could be solved without initializing a Graph. — Oleg Mikhailov
– Oleg Mikhailov, Commented Apr 11, 2019 at 9:13
@OlegMikhailov , Why do you say "classic SQL solution with recursive joins will not work for Spark DataFrames."? I thought large table joins (in this case with them selves) are fast in Spark — travelingbones
– travelingbones, Commented May 5, 2023 at 21:12

Igor Uchôa · Accepted Answer · 2019-04-17 13:44:37Z

This looks like a Spark Graph API task. You can look at Graphframes spark package. It's a package that provides high-level APIs over GraphX core (the same used in traditional Spark Dataframes over RDDs). With this, you can build Graphs with your dataframes.

Look at this link: https://mapr.com/blog/analyzing-flight-delays-with-apache-spark-graphframes-and-mapr-db/

It show a use case with flights data. If you look at Breadth First Search Graph Algorithm section, you will see an algorithm that do exactly what you want: Finding a path between two vertices (given a maxPathLength param).

Run pyspark with graphframes dependencies (according with your spark version):

pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11

Building your dataframe:

df = sc.parallelize([{"id": 5, "parent": 0}, {"id": 4, "parent": 3}, {"id": 3, "parent": 1}, {"id": 2, "parent": 1}, {"id": 1, "parent": 0}, {"id": 0, "parent": None}]).toDF()

Creating a graph:

df_vertices = df.selectExpr("id")
df_edges = df.withColumnRenamed("id", "dst").withColumnRenamed("parent", "src")

from graphframes import GraphFrame
graph  = GraphFrame(df_vertices, df_edges)

Visualize the path (from 0 to 4 for example):

graph.bfs(fromExpr="id = 0",toExpr="id = 4", maxPathLength=10).show(2)

Result:

+----+------+---+------+---+------+---+
|from|    e0| v1|    e1| v2|    e2| to|
+----+------+---+------+---+------+---+
| [0]|[1, 0]|[1]|[3, 1]|[3]|[4, 3]|[4]|
+----+------+---+------+---+------+---+

Collectives™ on Stack Overflow

Spark: build recursive tree path for every node of an hierarchy DataFrame

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related