5

I am trying to flatten a schema of existing dataframe with nested fields. Structure of my dataframe is something like that:

root  
|-- Id: long (nullable = true)  
|-- Type: string (nullable = true)  
|-- Uri: string (nullable = true)    
|-- Type: array (nullable = true)  
|    |-- element: string (containsNull = true)  
|-- Gender: array (nullable = true)  
|    |-- element: string (containsNull = true)

Type and gender can contain array of elements, one element or null value. I tried to use the following code:

var resDf = df.withColumn("FlatType", explode(df("Type")))

But as a result in a resulting data frame I loose rows for which I had null values for Type column. It means, for example, if I have 10 rows and in 7 rows type is null and in 3 type is not null, after I use explode in resulting data frame I have only three rows.

How can I keep rows with null values but explode array of values?

I found some kind of workaround but still stuck in one place. For standard types we can do the following:

def customExplode(df: DataFrame, field: String, colType: String): org.apache.spark.sql.Column = {
var exploded = None: Option[org.apache.spark.sql.Column]
colType.toLowerCase() match {
  case "string" => 
    val avoidNull = udf((column: Seq[String]) =>
    if (column == null) Seq[String](null)
    else column)
    exploded = Some(explode(avoidNull(df(field))))
  case "boolean" => 
    val avoidNull = udf((xs: Seq[Boolean]) =>
    if (xs == null) Seq[Boolean]()
    else xs)
    exploded = Some(explode(avoidNull(df(field))))
  case _ => exploded = Some(explode(df(field)))
}
exploded.get

}

And after that just use it like this:

val explodedField = customExplode(resultDf, fieldName, fieldTypeMap(field))
resultDf = resultDf.withColumn(newName, explodedField)

However, I have a problem for struct type for the following type of structure:

 |-- Address: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- AddressType: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true) 
 |    |    |-- DEA: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Number: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- ExpirationDate: array (nullable = true)
 |    |    |    |    |    |-- element: timestamp (containsNull = true)
 |    |    |    |    |-- Status: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)

How can we process that kind of schema when DEA is null?

Thank you in advance.

P.S. I tried to use Lateral views but result is the same.

2 Answers 2

4

Maybe you can try using when:

val resDf = df.withColumn("FlatType", when(df("Type").isNotNull, explode(df("Type")))

As shown in the when function's documentation, the value null is inserted for the values that do not match the conditions.

Sign up to request clarification or add additional context in comments.

3 Comments

Sorry, but when I try this solution I have the following exception: java.lang.UnsupportedOperationException. If I replace explode(df("Type")) with just some value - it works fine. I suppose when function does not support exploded column as value
@Artem, you are correct, I'm sorry. Is a union an option for you? You could do df.where($"Type".isNull).withColumn("FlatType", lit(null)).unionAll(df.withColumn("FlatType", explode($"Type")))
yes, thank you, I thought about this option but I am building a generic algorithm for flattening schema and I am afraid that union may be really slow. I hope to find a better solution but union is a backup option for me.
1

I think what you wanted is to use explode_outer instead of explode

see apache docs : explode and explode_outer

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.