0

I've a dataframe with following schema -

|-- ID: string (nullable = true)
|-- VALUES: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- _v1: string (nullable = true)
|    |    |-- _v2: string (nullable = true)

VALUES are like -

[["ABC","a"],["PQR","c"],["XYZ","b"],["TUV","d"]]
[["PQR","g"]]
[["TUV","f"],["ABC","e"]]

I've to select a single struct from this array based on the value of _v1. There is a hierarchy in these values like -

"ABC" -> "XYZ" -> "PQR" -> "TUV"

Now, if "TUV" is present, we will select the row with "TUV" in its _v1. Else we will check for "PQR". If "PQR" is present, take its row. Else check for "XYZ" and so on.

The result df should look like - (which will be StructType now, not Array[Struct])

["TUV","d"]
["PQR","g"]
["TUV","f"]

Can someone please guide me how can I approach this problem by creating a udf ? Thanks in advance.

2
  • Is the order "ABC" -> "XYZ" -> "PQR" -> "TUV" in a list? Commented Oct 28, 2017 at 7:55
  • This is not in a list. I just have to follow this order. Check for "TUV", if it exists - take it. Else check for "PQR", if it exists - take it. Else check for "XYZ" and so on. But it can be placed in a list. Commented Oct 28, 2017 at 7:58

2 Answers 2

1

you can do something like below

import org.apache.spark.sql.functions._
def getIndex = udf((array : mutable.WrappedArray[String]) => {
  if(array.contains("TUV")) array.indexOf("TUV")
  else if(array.contains("PQR")) array.indexOf("PQR")
  else if(array.contains("XYZ")) array.indexOf("XYZ")
  else if(array.contains("ABC")) array.indexOf("ABC")
  else 0
})

df.select($"VALUES"(getIndex($"VALUES._v1")).as("selected"))

You should have following output

+--------+
|selected|
+--------+
|[TUV,d] |
|[PQR,g] |
|[TUV,f] |
+--------+

I hope the answer is helpful

Updated

You can select the elements of struct column by using . notation. Here $"VALUES._v1" is selecting all the _v1 of struct and passing them to udf function as Array in the same order.

for example : for [["ABC","a"],["PQR","c"],["XYZ","b"],["TUV","d"]], $"VALUES._v1" would return ["ABC","PQR","XYZ","TUV"] which is passed to udf function

Inside udf function, index of array where the strings matched is returned. for example : for ["ABC","PQR","XYZ","TUV"], "TUV" matches so it would return 3.

for the first row, getIndex($"VALUES._v1") would return 3 so $"VALUES"(getIndex($"VALUES._v1") is equivalent to $"VALUES"(3) which is the fourth element of [["ABC","a"],["PQR","c"],["XYZ","b"],["TUV","d"]] i.e. ["TUV","d"] .

I hope the explanation is clear.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. It worked like a charm. Can you please explain your solution a little but more, specially how you are passing the parameters to the udf and getting the results ?
explanation is added. :) thanks for the upvote and acceptance
1

This should work as long as each row only contains each _v1 values at most once. The UDF will return the index of the best value in the hierarchy list. Then the stuct containing this value in _v1 will be selected and put into the "select" column.

val hierarchy = List("TUV", "PQR", "XYZ", "ABC")

val findIndex = udf((rows: Seq[String]) => {
  val s = rows.toSet
  val best = hierarchy.filter(h => s contains h).head
  rows.indexOf(best)
})

df.withColumn("select", $"VALUES"(findIndex($"VALUES._v2")))

A list is used for the order to make it easy to extend to more than 4 values.

2 Comments

Thanks. Your method worked as expected. But what if I've n number of nested columns in VALUES, not just _v1 and _v2 ?
@Ishan Didn't realize that was a constraint. Made it possible for there to be more nested values, the hierarchy is extendable to more than 4 values as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.