Scala: Delete empty array values from a Spark DataFrame

Question

I'm a new learner of Scala. Now given a DataFrame named df as follows:

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]|  [0.0]|  [0.0]| [null]|
| [IND1]|  [5.0]|  [6.0]|    [A]|
| [IND2]|  [7.0]|  [8.0]|    [B]|
|     []|     []|     []|     []|
+-------+-------+-------+-------+

I'd like to delete rows if all columns is an empty array (4th row).

For example I might expect the result is:

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]|  [0.0]|  [0.0]| [null]|
| [IND1]|  [5.0]|  [6.0]|    [A]|
| [IND2]|  [7.0]|  [8.0]|    [B]|
+-------+-------+-------+-------+

I'm trying to use isNotNull (like val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show() ) but still show all rows.

I found python solution of using a Hive UDF from link, but I had hard time trying to convert to a valid scala code. I would like use scala command similar to the following code:

val query = "SELECT * FROM targetDf WHERE {0}".format(" AND ".join("SIZE({0}) > 0".format(c) for c in ["Column1", "Column2", "Column3","Column4"]))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql(query)

Any help would be appreciated. Thank you.

fletchr · Accepted Answer · 2018-11-11 19:30:54Z

2

Using the isNotNull or isNull will not work because it is looking for a 'null' value in the DataFrame. Your example DF does not contain null values but empty values, there is a difference there.

One option: You could create a new column that has the length of of the array and filter for if the array is zero.

  val dfFil = df
    .withColumn("arrayLengthColOne", size($"Column1"))
    .withColumn("arrayLengthColTwo", size($"Column2"))
    .withColumn("arrayLengthColThree", size($"Column3"))
    .withColumn("arrayLengthColFour", size($"Column4"))
    .filter($"arrayLengthColOne" =!= 0 && $"arrayLengthColTwo" =!= 0 
    && $"arrayLengthColThree" =!= 0 && $"arrayLengthColFour" =!= 0)
    .drop("arrayLengthColOne", "arrayLengthColTwo", "arrayLengthColThree", "arrayLengthColFour")

Original DF:

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
|    [A]|    [B]|    [C]|    [d]|
|     []|     []|     []|     []|
+-------+-------+-------+-------+

New DF:

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
|    [A]|    [B]|    [C]|    [d]|
+-------+-------+-------+-------+

You could also create a function that will map across all the columns and do it.

edited Nov 11, 2018 at 19:30

answered Nov 11, 2018 at 18:34

fletchr

6442 gold badges9 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Haven Shi Over a year ago

Hi fletchr thanks for the reply. It is a very good direction. My only concern is, this withColumn filter condition only check size($"Column1"), therefore it may delete | []| [0.0]| [0.0]| [A]|. Is there anyway to modify condition as filter sum of size($"Column1"), size($"Column2"), size($"Column3") and size($"Column4")?

fletchr Over a year ago

Let me know if this clarifies your problem. It is a bit iterative and cumbersome, but I think it will solve your problem. I just edited my answer. I'll let you find out a more elegant solution

Haven Shi Over a year ago

Thank you for the update, I think it should be a correct answer. However on my side it shows "value && is not a member of Int" on every $$ symbol, have you seen that error before?

fletchr Over a year ago

@HavenShi try import spark.implicits._ or instead of suing $ you can use col("columnOfInterest")

machine head · Accepted Answer · 2018-11-12 13:25:19Z

1

Another approach (in addition to accepted answer) would be using Datasets.
For example, by having a case class:

case class MyClass(col1: Seq[String],
                   col2: Seq[Double],
                   col3: Seq[Double],
                   col4: Seq[String]) { 
    def isEmpty: Boolean = ...
}

You can represent your source as a typed structure:

import spark.implicits._ // needed to provide an implicit encoder/data mapper 

val originalSource: DataFrame = ... // provide your source
val source: Dataset[MyClass] = originalSource.as[MyClass] // convert/map it to Dataset

So you could do filtering like following:

source.filter(element => !element.isEmpty) // calling class's instance method

edited Nov 12, 2018 at 13:25

answered Nov 12, 2018 at 10:27

machine head

314 bronze badges

Collectives™ on Stack Overflow

Scala: Delete empty array values from a Spark DataFrame

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related