8

Let's assume that we have a Spark DataFrame

df.getClass
Class[_ <: org.apache.spark.sql.DataFrame] = class org.apache.spark.sql.DataFrame

with the following schema

df.printSchema
root
|-- rawFV: string (nullable = true)
|-- tk: array (nullable = true)
|    |-- element: string (containsNull = true)

Given that each row of the tk column is an array of strings, how to write a Scala function that will return the number of elements in each row?

2 Answers 2

19

You don't have to write a custom function because there is one:

import org.apache.spark.sql.functions.size

df.select(size($"tk"))

If you really want you can write an udf:

import org.apache.spark.sql.functions.udf

val size_ = udf((xs: Seq[String]) => xs.size)

or even create custom a expression but there is really no point in that.

Sign up to request clarification or add additional context in comments.

7 Comments

Perfect! For generality, I would like to know how to apply a UDF to a dataframe. Could you point me to a simple example?
There are dozens of examples on SO (a couple of examples) and as always source (especially tests) are good place to start.
How would you use this size_ function?
Same way as built-in size ( size_($"tk")).
What about if I want to define size_ with a def? I understand it may look like a complete overkill but this way it would be very easy to change to something else.
|
1

One way is to access them using the sql like below.

df.registerTempTable("tab1")
val df2 = sqlContext.sql("select tk[0], tk[1] from tab1")

df2.show()

To get size of array column,

val df3 = sqlContext.sql("select size(tk) from tab1")
df3.show()

If your Spark version is older, you can use HiveContext instead of Spark's SQL Context.

I would also try for some thing that traverses.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.