0

I have written a script in AWS Glue for reading a CSV file from AWS S3, applying null check on few fields and storing the results back to S3 as a new file. The problem is when it encounters a field of String type if the value is null it's getting converted to empty string. But I don't want this conversion to happen. For all other data types, it's working fine.

Here is the script that is written so far:

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session

# s3 output directory
output_dir = "s3://aws-glue-scripts/..."

# Data Catalog: database and table name
db_name = "sampledb"
tbl_name = "mytable"

datasource = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_name)

datasource_df = datasource.toDF()   
datasource_df.createOrReplaceTempView("myNewTable")
datasource_sql_df = spark.sql("SELECT * FROM myNewTable WHERE name IS NULL")
datasource_sql_df.show()

datasource_sql_dyf = DynamicFrame.fromDF(datasource_sql_df, glueContext, "datasource_sql_dyf")
glueContext.write_dynamic_frame.from_options(frame = datasource_sql_dyf, 
connection_type = "s3", connection_options = {"path": output_dir}, format = "json")

Can anyone please suggest how to get rid of this problem?

Thanks.

1 Answer 1

1

I think it is not possible currently. Spark is configured to ignore the null values when writing JSON. In csv reader it explicitly put null values as empty.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.