5

Consider a tree and its DataFrame representation (left table):

0             ┌───────┬───────┐           ┌───────┬───────┐
├──1          │   id  │ parent│           │   id  │ path  │
│  ├──2       ├───────┼───────┤           ├───────┼───────┤
│  └──3       │   5   │   0   │           │   5   │0/5    │
│     └──4    ├───────┼───────┤           ├───────┼───────┤
└──5          │   4   │   3   │           │   4   │0/1/3/4│
              ├───────┼───────┤     =>    ├───────┼───────┤
              │   3   │   1   │           │   3   │0/1/3  │
              ├───────┼───────┤           ├───────┼───────┤
              │   2   │   1   │           │   2   │0/1/2  │
              ├───────┼───────┤           ├───────┼───────┤
              │   1   │   0   │           │   1   │0/1    │
              ├───────┼───────┤           ├───────┼───────┤
              │   0   │ null  │           │   0   │0      │
              └───────┴───────┘           └───────┴───────┘

What is the most efficient way to get a tree path (starting from the root) for each node of the tree (right table)?

All possible methods are allowed: SQL-queries, DataFrame methods, GraphX etc.

Note: classic SQL solution with recursive joins will not work for Spark DataFrames.

5
  • 1
    I suspect GraphX would be the way to go but I doubt it would be very efficient. Commented Apr 11, 2019 at 9:06
  • 1
    Yes, it seems this task could be solved without initializing a Graph. Commented Apr 11, 2019 at 9:13
  • @OlegMikhailov, how about RDD's mapPartitions? Commented Apr 17, 2019 at 2:13
  • @Sai, all the methods are good while they are effective Commented Apr 18, 2019 at 8:27
  • 1
    @OlegMikhailov , Why do you say "classic SQL solution with recursive joins will not work for Spark DataFrames."? I thought large table joins (in this case with them selves) are fast in Spark Commented May 5, 2023 at 21:12

1 Answer 1

5
+50

This looks like a Spark Graph API task. You can look at Graphframes spark package. It's a package that provides high-level APIs over GraphX core (the same used in traditional Spark Dataframes over RDDs). With this, you can build Graphs with your dataframes.

Look at this link: https://mapr.com/blog/analyzing-flight-delays-with-apache-spark-graphframes-and-mapr-db/

It show a use case with flights data. If you look at Breadth First Search Graph Algorithm section, you will see an algorithm that do exactly what you want: Finding a path between two vertices (given a maxPathLength param).

Run pyspark with graphframes dependencies (according with your spark version):

pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11

Building your dataframe:

df = sc.parallelize([{"id": 5, "parent": 0}, {"id": 4, "parent": 3}, {"id": 3, "parent": 1}, {"id": 2, "parent": 1}, {"id": 1, "parent": 0}, {"id": 0, "parent": None}]).toDF()

Creating a graph:

df_vertices = df.selectExpr("id")
df_edges = df.withColumnRenamed("id", "dst").withColumnRenamed("parent", "src")

from graphframes import GraphFrame
graph  = GraphFrame(df_vertices, df_edges)

Visualize the path (from 0 to 4 for example):

graph.bfs(fromExpr="id = 0",toExpr="id = 4", maxPathLength=10).show(2)

Result:

+----+------+---+------+---+------+---+
|from|    e0| v1|    e1| v2|    e2| to|
+----+------+---+------+---+------+---+
| [0]|[1, 0]|[1]|[3, 1]|[3]|[4, 3]|[4]|
+----+------+---+------+---+------+---+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.