1

I have a data frame with columns nullability as True. Wanted to convert to False in Pyspark.

I can do it in the below way. But I don't want to convert to rdd because I'm reading as structured streaming and converting to rdd is not recommended.

def set_df_columns_nullable(self, spark, df, column_list, nullable=True):
        for struct_field in df.schema:
            if struct_field.name in column_list:
                struct_field.nullable = nullable
        df_mod = spark.createDataFrame(df.rdd, df.schema)
        return df_mod

Thanks in Advance

5
  • Not possible in fact. Whay are yiu doing this? Commented Jun 27, 2020 at 17:49
  • Actually I'm using Abris to convert normal data to confluent avro format before writing to kafka. while I'm using to_confluent_avro function, It is throwing Not a Union exception. So It is working, If I change the nullability of the column to False. Commented Jun 27, 2020 at 18:08
  • that's different then but I meant RDDs are not supported. Commented Jun 27, 2020 at 18:11
  • Actually I'm using structured streaming, converting to rdd and backing to DF is overhead. because of this, I may miss some features Commented Jun 27, 2020 at 18:15
  • I always learnt that was not supported. Interesting. Commented Jun 27, 2020 at 18:16

1 Answer 1

1

You can actually update column nullability without casting to RDD

dataFrame
  .withColumn(columnName, new Column(AssertNotNull(col(columnName).expr)))

source

Note that the above would fail at execution if you have null values

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.