1

I am trying to create a Dataframe to write to a Big Query table. One column in the Output table is a REQUIRED ID that I need to generate in my pipeline. I am doing this with the use of a UDF but no matter what I try the column is being created as nullable.

How I've created the UDF:

UserDefinedFunction genID = functions.udf(
                (UDF1<String, String>) this::generateEmailCommID, DataTypes.StringType);

The method the UDF calls itself:

 private String generateEmailID(String srcId) {
        return UUID.nameUUIDFromBytes(("1_" + srcId).getBytes()).toString();
    }

And then I use this on my temp view transformedData like this:

spark.sql("SELECT message_ID AS src_id FROM transformedData")
          .withColumn(email_id, genID.apply(functions.col("src_id")))

This column needs to be REQUIRED to match the output table and column "src_id: is 'nullable=false'. So why does "email_id" get created "nullable=true" and how can I stop that from happening so I can write to the table?

root
 |-- email_id: string (nullable = true)
 |-- src_id: string (nullable = false)

1 Answer 1

3

That's probably how udf works. I would assume that Spark doesn't know what udf can return, so to be on a safe side, it makes the column nullable.

If you are sure you don't have nulls in the column, you can add coalesce("col_name", lit("")). I mean, depending on what import you have, you can use either

.withColumn("email_id", coalesce("email_id", lit("")))

or

.withColumn("email_id", functions.coalesce("email_id", functions.lit("")))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.