4

I want to delete one field from array.struct as follow:

 case class myObj (id: String, item_value: String, delete: String)
  case class myObj2 (id: String, item_value: String)

  val df2=Seq (
      ("1", "2","..100values", Seq(myObj ("A", "1a","1"),myObj ("B", "4r","2"))),
      ("1", "2","..100values", Seq(myObj ("X", "1p","11"),myObj ("V", "7w","8")))
  ).toDF("1","2","100fields","myArr")


val deleteColumn : (mutable.WrappedArray[myObj]=>mutable.WrappedArray[myObj2])= {
        (array: mutable.WrappedArray[myObj]) => array.map(o => myObj2(o.id, o.item_value))
      }
val myUDF3 = functions.udf(deleteColumn)
df2.withColumn("newArr",myUDF3($"myArr")).show(false)

Error is very clear:

Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (array<struct<id:string,item_value:string,delete:string>>) => array<struct< id:string,item_value:string>>)

It does not match, but is that I want to do, parse from one structure to another ¿?

I am using a UDF because df.map() is not good for mapping specific column and it forces to indicates all columns. So I didn´t find best method to apply this mapping for one column.

2 Answers 2

2

You can rewrite your UDF that takes a Row instead of custom object as below

val deleteColumn = udf((value: Seq[Row]) => {
  value.map(row => MyObj2(row.getString(0), row.getString(1)))
})

df2.withColumn("newArr", deleteColumn($"myArr"))

Output:

+---+---+-----------+---------------------+----------------+
|1  |2  |100fields  |myArr                |newArr          |
+---+---+-----------+---------------------+----------------+
|1  |2  |..100values|[[A,1a,1], [B,4r,2]] |[[A,1a], [B,4r]]|
|1  |2  |..100values|[[X,1p,11], [V,7w,8]]|[[X,1p], [V,7w]]|
+---+---+-----------+---------------------+----------------+
Sign up to request clarification or add additional context in comments.

Comments

1

Not using udf, one can easily remove fields from array of structs using dropFields together with transform.

Test input:

val df = spark.createDataFrame(Seq(("v1", "v2", "v3", "v4"))).toDF("f1", "f2", "f3", "f4")
    .select(
        array(
            struct("f1", "f2"),
            struct(col("f3").as("f1"), col("f4").as("f2")),
        ).as("myArr")
    )
df.printSchema()
// root
//  |-- myArr: array (nullable = false)
//  |    |-- element: struct (containsNull = false)
//  |    |    |-- f1: string (nullable = true)
//  |    |    |-- f2: string (nullable = true)

Script:

val df2 = df.withColumn(
    "myArr",
    transform(
        $"myArr",
        x => x.dropFields("f2")
    )
)
df2.printSchema()
// root
//  |-- myArr: array (nullable = false)
//  |    |-- element: struct (containsNull = false)
//  |    |    |-- f1: string (nullable = true)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.