-2

I'm a new learner of Scala. Now given a DataFrame named df as follows:

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]|  [0.0]|  [0.0]| [null]|
| [IND1]|  [5.0]|  [6.0]|    [A]|
| [IND2]|  [7.0]|  [8.0]|    [B]|
|     []|     []|     []|     []|
+-------+-------+-------+-------+

I'd like to delete rows if all columns is an empty array (4th row).

For example I might expect the result is:

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]|  [0.0]|  [0.0]| [null]|
| [IND1]|  [5.0]|  [6.0]|    [A]|
| [IND2]|  [7.0]|  [8.0]|    [B]|
+-------+-------+-------+-------+

I'm trying to use isNotNull (like val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show() ) but still show all rows.

I found python solution of using a Hive UDF from link, but I had hard time trying to convert to a valid scala code. I would like use scala command similar to the following code:

val query = "SELECT * FROM targetDf WHERE {0}".format(" AND ".join("SIZE({0}) > 0".format(c) for c in ["Column1", "Column2", "Column3","Column4"]))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql(query)

Any help would be appreciated. Thank you.

2 Answers 2

2

Using the isNotNull or isNull will not work because it is looking for a 'null' value in the DataFrame. Your example DF does not contain null values but empty values, there is a difference there.

One option: You could create a new column that has the length of of the array and filter for if the array is zero.

  val dfFil = df
    .withColumn("arrayLengthColOne", size($"Column1"))
    .withColumn("arrayLengthColTwo", size($"Column2"))
    .withColumn("arrayLengthColThree", size($"Column3"))
    .withColumn("arrayLengthColFour", size($"Column4"))
    .filter($"arrayLengthColOne" =!= 0 && $"arrayLengthColTwo" =!= 0 
    && $"arrayLengthColThree" =!= 0 && $"arrayLengthColFour" =!= 0)
    .drop("arrayLengthColOne", "arrayLengthColTwo", "arrayLengthColThree", "arrayLengthColFour")

Original DF:

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
|    [A]|    [B]|    [C]|    [d]|
|     []|     []|     []|     []|
+-------+-------+-------+-------+

New DF:

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
|    [A]|    [B]|    [C]|    [d]|
+-------+-------+-------+-------+

You could also create a function that will map across all the columns and do it.

Sign up to request clarification or add additional context in comments.

4 Comments

Hi fletchr thanks for the reply. It is a very good direction. My only concern is, this withColumn filter condition only check size($"Column1"), therefore it may delete | []| [0.0]| [0.0]| [A]|. Is there anyway to modify condition as filter sum of size($"Column1"), size($"Column2"), size($"Column3") and size($"Column4")?
Let me know if this clarifies your problem. It is a bit iterative and cumbersome, but I think it will solve your problem. I just edited my answer. I'll let you find out a more elegant solution
Thank you for the update, I think it should be a correct answer. However on my side it shows "value && is not a member of Int" on every $$ symbol, have you seen that error before?
@HavenShi try import spark.implicits._ or instead of suing $ you can use col("columnOfInterest")
1

Another approach (in addition to accepted answer) would be using Datasets.
For example, by having a case class:

case class MyClass(col1: Seq[String],
                   col2: Seq[Double],
                   col3: Seq[Double],
                   col4: Seq[String]) { 
    def isEmpty: Boolean = ...
}

You can represent your source as a typed structure:

import spark.implicits._ // needed to provide an implicit encoder/data mapper 

val originalSource: DataFrame = ... // provide your source
val source: Dataset[MyClass] = originalSource.as[MyClass] // convert/map it to Dataset

So you could do filtering like following:

source.filter(element => !element.isEmpty) // calling class's instance method

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.