Spark SQL: null values are getting converted to empty string in results file

Question

I have written a script in AWS Glue for reading a CSV file from AWS S3, applying null check on few fields and storing the results back to S3 as a new file. The problem is when it encounters a field of String type if the value is null it's getting converted to empty string. But I don't want this conversion to happen. For all other data types, it's working fine.

Here is the script that is written so far:

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session

# s3 output directory
output_dir = "s3://aws-glue-scripts/..."

# Data Catalog: database and table name
db_name = "sampledb"
tbl_name = "mytable"

datasource = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_name)

datasource_df = datasource.toDF()   
datasource_df.createOrReplaceTempView("myNewTable")
datasource_sql_df = spark.sql("SELECT * FROM myNewTable WHERE name IS NULL")
datasource_sql_df.show()

datasource_sql_dyf = DynamicFrame.fromDF(datasource_sql_df, glueContext, "datasource_sql_dyf")
glueContext.write_dynamic_frame.from_options(frame = datasource_sql_dyf, 
connection_type = "s3", connection_options = {"path": output_dir}, format = "json")

Can anyone please suggest how to get rid of this problem?

Thanks.

Avishek Bhattacharya · Accepted Answer · 2017-10-03 13:48:09Z

1

I think it is not possible currently. Spark is configured to ignore the null values when writing JSON. In csv reader it explicitly put null values as empty.

answered Oct 3, 2017 at 13:48

Avishek Bhattacharya

7,0243 gold badges38 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark SQL: null values are getting converted to empty string in results file

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related