0

have data table with hierarchy data model with tree structures. For example: Here is a sample data row:

-------------------------------------------
Id | name    |parentId | path       | depth
-------------------------------------------
55 | Canada  | null    | null       | 0
77 | Ontario |  55     | /55        | 1
100| Toronto |  77     | /55/77     | 2
104| Brampton| 100     | /55/77/100 | 3

I am looking to convert those rows into flattening version, sample output would be:

-------------------------------------------------------
Id | name    |parentId | path       | depth | pathNames
-------------------------------------------------------
55 | Canada  | null    | null       | 0 .   | None
77 | Ontario |  55     | /55        | 1 .   | Canada
100| Toronto |  77     | /55/77     | 2 .   | Canada, Ontario
104| Brampton| 100     | /55/77/100 | 3 .   | Canada, Ontario, Toronto

To simply how the PathFullNames is generated, it comes from the same table matching on the ids from the path. So in the above example /55/77/100 is equal to /Canada/Ontario/Toronto

Hope that makes sense.

5
  • 2
    Possible duplicate of Scala spark - Dealing with Hierarchy data tables Commented Mar 22, 2018 at 21:02
  • Almost similar but different outpu Commented Mar 22, 2018 at 21:17
  • It would make your question clearer if you explained where the pathNames come from (i.e. looked up by Id) rather than making the reader figure this out for themselves. Commented Mar 22, 2018 at 21:24
  • Oh okay, sure I can make it more clear. I thought it was obvious from looking at the Path column to understand pathFullName columns Commented Mar 22, 2018 at 21:26
  • Graphframes could be useful in this situation. Commented Mar 22, 2018 at 21:29

1 Answer 1

1

maybe this will help specifically with your problem:

You can create a dict from columns Id and name

// Generate a dict: Id -> name
val idMap = test.distinct.select($"Id", $"name").rdd.map(r => (r.getInt(0), r.getString(1))).collectAsMap

then define a UDF (user defined function) that will map the string

/55/77

to the string

Canada,Ontario

val pathMap = udf((p: String) => p.split("/").filter(_!="").map(id => idMap(id.toInt)).mkString(","))

finally, add a new column using this UDF and the path column

test.select(col("*"), when($"path".isNull, "None").otherwise(pathMap($"path")).as("pathNames")).show(false)

this gives you the dataframe you want:

+---+--------+--------+----------+-----+----------------------+
|Id |name    |parentId|path      |depth|pathNames             |
+---+--------+--------+----------+-----+----------------------+
|55 |Canada  |null    |null      |0    |None                  |
|77 |Ontario |55      |/55       |1    |Canada                |
|100|Toronto |77      |/55/77    |2    |Canada,Ontario        |
|104|Brampton|100     |/55/77/100|3    |Canada,Ontario,Toronto|
+---+--------+--------+----------+-----+----------------------+

Hope this will help you!

pd: Sorry for my english

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.