Problem in reading string NULL values from BigQuery

Question

Currently I am using spark to read data from bigqiery tables and write it to storage bucket as csv. One issue that i am facing is that the null string values are not being read properly by spark from bq. It reads the null string values but in the csv it writes that value as an empty string with double quotes (i.e. like this "").

# Load data from BigQuery.
bqdf = spark.read.format('bigquery') \
    .option('table', <bq_dataset> + <bq_table>) \
    .load()
bqdf.createOrReplaceTempView('bqdf')

# Select required data into another df
bqdf2 = spark.sql(
    'SELECT * FROM bqdf')

# write to GCS
bqdf2.write.csv(<gcs_data_path> + <bq_table> + '/' , mode='overwrite', sep= '|')

I have tried emptyValue='' and nullValue options with df.write.csv() while writing to csv but dosen't work.

I needed a solution for this problem, if anyone else faced this issue and could help. Thanks!

What do you try to achieve? Instead of "" writing NULL into CSV, or someting else? Is it really a problem in your dataflow? Are there strings where it makes a difference between having a string "" of length zero and a null value? Writing "" as a null string makes it possible for a parser to automatically treat that column as a string. — Simon Schiff
– Simon Schiff, Commented May 12, 2020 at 10:29
Yes i have scripts which need these string values to be NULL. The scripts have conditions checking whether string IS NULL for joins. — Bajwa
– Bajwa, Commented May 12, 2020 at 12:50

Alexandre Moraes · Accepted Answer · 2020-05-15 08:09:21Z

I was able to reproduce your case and I found a solution that worked with a sample table I created in BigQuery. The data is as follows:

According to the PySpark documentation, in the class pyspark.sql.DataFrameWriter(df), there is an option called nullValue:

nullValue – sets the string representation of a null value. If None is set, it uses the default value, empty string.

Which is what you are looking for. Then, I just implemented nullValue option below.

sc = SparkContext()
spark = SparkSession(sc)

# Read the data from BigQuery as a Spark Dataframe.
data = spark.read.format("bigquery").option(
    "table", "dataset.table").load()

# Create a view so that Spark SQL queries can be run against the data.
data.createOrReplaceTempView("data_view")

# Select required data into another df
data_view2 = spark.sql(
    'SELECT * FROM data_view')

df=data_view2.write.csv('gs://bucket/folder', header=True, nullValue='')

data_view2.show()

Notice that I have used data_view2.show() to print out the view in order to check if it was correctly read. The output was:

+------+---+
|name  |age|
+------+---+
|Robert| 25|
|null  | 23|
+------+---+

Therefore, the null value was precisely interpreted. In addition, I also checked the .csv file:

name,age
Robert,25
,23

As you can see the null value is correct and not represented as between double quotes as an empty String. Finally, just as a final inspection I created a load job from this .csv file to BigQuery. The table was created and the null value was interpreted accurately.

Note: I ran the pyspark job from the DataProc job's console in a DataProc cluster, previously created. Also, the cluster was at the same location as the dataset in BigQuery.

Collectives™ on Stack Overflow

Problem in reading string NULL values from BigQuery

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related