I have written a script in AWS Glue for reading a CSV file from AWS S3, applying null check on few fields and storing the results back to S3 as a new file. The problem is when it encounters a field of String type if the value is null it's getting converted to empty string. But I don't want this conversion to happen. For all other data types, it's working fine.
Here is the script that is written so far:
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
# s3 output directory
output_dir = "s3://aws-glue-scripts/..."
# Data Catalog: database and table name
db_name = "sampledb"
tbl_name = "mytable"
datasource = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_name)
datasource_df = datasource.toDF()
datasource_df.createOrReplaceTempView("myNewTable")
datasource_sql_df = spark.sql("SELECT * FROM myNewTable WHERE name IS NULL")
datasource_sql_df.show()
datasource_sql_dyf = DynamicFrame.fromDF(datasource_sql_df, glueContext, "datasource_sql_dyf")
glueContext.write_dynamic_frame.from_options(frame = datasource_sql_dyf,
connection_type = "s3", connection_options = {"path": output_dir}, format = "json")
Can anyone please suggest how to get rid of this problem?
Thanks.