0

I have two dataframes, and I want to add to the first of them all the columns that are in the second, but not in the first. I got an array of StructField columns that I want to add to the dataframe, and fill with nulls.

That's the best I've come up with:

private def addColumns(df: DataFrame, columnsToAdd: Array[StructField]): DataFrame = {
    val spark = df.sparkSession
    val schema = new StructType(df.schema.toArray ++ columnsToAdd)
    spark.createDataFrame(df.rdd, schema)
}

Is there any better way?

1
  • As it turned out, my method does not work, when calling any action, spark crashes with java.lang.ArrayIndexOutOfBoundsException Commented Aug 12, 2022 at 14:53

1 Answer 1

2

My solution that I gave in the question unfortunately does not work. Crashes with the error java.lang.ArrayIndexOutOfBoundsException. As I understand it, the fact is that even though I added columns to the schema, they were not added to the dataframe, spark is trying to access the next data frame field, which is in the schema, but not in the real data.

I wrote such a variant, it uses recursion and does what I want. Although of course I would like to abandon the use of null, and somehow replace it with None.

@tailrec
private def addColumns(df: DataFrame, columnsToAdd: Array[StructField], indx: Int): DataFrame = {
    if(columnsToAdd.length == indx || columnsToAdd.isEmpty) df
    else {
        val dfWithColumn = df.withColumn(columnsToAdd(indx).name, lit(null).cast(columnsToAdd(indx).dataType))
        addColumns(dfWithColumn, columnsToAdd, indx + 1)
    }
}

Also this answer helped a lot.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.