Selecting a row from array<struct> based on given condition

Question

I've a dataframe with following schema -

|-- ID: string (nullable = true)
|-- VALUES: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- _v1: string (nullable = true)
|    |    |-- _v2: string (nullable = true)

VALUES are like -

[["ABC","a"],["PQR","c"],["XYZ","b"],["TUV","d"]]
[["PQR","g"]]
[["TUV","f"],["ABC","e"]]

I've to select a single struct from this array based on the value of _v1. There is a hierarchy in these values like -

"ABC" -> "XYZ" -> "PQR" -> "TUV"

Now, if "TUV" is present, we will select the row with "TUV" in its _v1. Else we will check for "PQR". If "PQR" is present, take its row. Else check for "XYZ" and so on.

The result df should look like - (which will be StructType now, not Array[Struct])

["TUV","d"]
["PQR","g"]
["TUV","f"]

Can someone please guide me how can I approach this problem by creating a udf ? Thanks in advance.

This is not in a list. I just have to follow this order. Check for "TUV", if it exists - take it. Else check for "PQR", if it exists - take it. Else check for "XYZ" and so on. But it can be placed in a list. — Ishan
– Ishan, Commented Oct 28, 2017 at 7:58

Anahcolus · Accepted Answer · 2017-10-28 09:31:10Z

1

you can do something like below

import org.apache.spark.sql.functions._
def getIndex = udf((array : mutable.WrappedArray[String]) => {
  if(array.contains("TUV")) array.indexOf("TUV")
  else if(array.contains("PQR")) array.indexOf("PQR")
  else if(array.contains("XYZ")) array.indexOf("XYZ")
  else if(array.contains("ABC")) array.indexOf("ABC")
  else 0
})

df.select($"VALUES"(getIndex($"VALUES._v1")).as("selected"))

You should have following output

+--------+
|selected|
+--------+
|[TUV,d] |
|[PQR,g] |
|[TUV,f] |
+--------+

I hope the answer is helpful

Updated

You can select the elements of struct column by using . notation. Here $"VALUES._v1" is selecting all the _v1 of struct and passing them to udf function as Array in the same order.

for example : for [["ABC","a"],["PQR","c"],["XYZ","b"],["TUV","d"]], $"VALUES._v1" would return ["ABC","PQR","XYZ","TUV"] which is passed to udf function

Inside udf function, index of array where the strings matched is returned. for example : for ["ABC","PQR","XYZ","TUV"], "TUV" matches so it would return 3.

for the first row, getIndex($"VALUES._v1") would return 3 so $"VALUES"(getIndex($"VALUES._v1") is equivalent to $"VALUES"(3) which is the fourth element of [["ABC","a"],["PQR","c"],["XYZ","b"],["TUV","d"]] i.e. ["TUV","d"] .

I hope the explanation is clear.

edited Oct 28, 2017 at 9:31

answered Oct 28, 2017 at 8:11

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ishan Over a year ago

Thanks. It worked like a charm. Can you please explain your solution a little but more, specially how you are passing the parameters to the udf and getting the results ?

Anahcolus Over a year ago

explanation is added. :) thanks for the upvote and acceptance

Shaido · Accepted Answer · 2017-10-29 05:29:24Z

1

This should work as long as each row only contains each _v1 values at most once. The UDF will return the index of the best value in the hierarchy list. Then the stuct containing this value in _v1 will be selected and put into the "select" column.

val hierarchy = List("TUV", "PQR", "XYZ", "ABC")

val findIndex = udf((rows: Seq[String]) => {
  val s = rows.toSet
  val best = hierarchy.filter(h => s contains h).head
  rows.indexOf(best)
})

df.withColumn("select", $"VALUES"(findIndex($"VALUES._v2")))

A list is used for the order to make it easy to extend to more than 4 values.

edited Oct 29, 2017 at 5:29

answered Oct 28, 2017 at 8:12

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

2 Comments

Ishan Over a year ago

Thanks. Your method worked as expected. But what if I've n number of nested columns in VALUES, not just _v1 and _v2 ?

Shaido Over a year ago

@Ishan Didn't realize that was a constraint. Made it possible for there to be more nested values, the hierarchy is extendable to more than 4 values as well.

Collectives™ on Stack Overflow

Selecting a row from array<struct> based on given condition

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related