0

Currently I am using spark to read data from bigqiery tables and write it to storage bucket as csv. One issue that i am facing is that the null string values are not being read properly by spark from bq. It reads the null string values but in the csv it writes that value as an empty string with double quotes (i.e. like this "").

# Load data from BigQuery.
bqdf = spark.read.format('bigquery') \
    .option('table', <bq_dataset> + <bq_table>) \
    .load()
bqdf.createOrReplaceTempView('bqdf')

# Select required data into another df
bqdf2 = spark.sql(
    'SELECT * FROM bqdf')

# write to GCS
bqdf2.write.csv(<gcs_data_path> + <bq_table> + '/' , mode='overwrite', sep= '|')

I have tried emptyValue='' and nullValue options with df.write.csv() while writing to csv but dosen't work.

I needed a solution for this problem, if anyone else faced this issue and could help. Thanks!

2
  • 1
    What do you try to achieve? Instead of "" writing NULL into CSV, or someting else? Is it really a problem in your dataflow? Are there strings where it makes a difference between having a string "" of length zero and a null value? Writing "" as a null string makes it possible for a parser to automatically treat that column as a string. Commented May 12, 2020 at 10:29
  • Yes i have scripts which need these string values to be NULL. The scripts have conditions checking whether string IS NULL for joins. Commented May 12, 2020 at 12:50

1 Answer 1

0

I was able to reproduce your case and I found a solution that worked with a sample table I created in BigQuery. The data is as follows:

enter image description here

According to the PySpark documentation, in the class pyspark.sql.DataFrameWriter(df), there is an option called nullValue:

nullValue – sets the string representation of a null value. If None is set, it uses the default value, empty string.

Which is what you are looking for. Then, I just implemented nullValue option below.

sc = SparkContext()
spark = SparkSession(sc)

# Read the data from BigQuery as a Spark Dataframe.
data = spark.read.format("bigquery").option(
    "table", "dataset.table").load()

# Create a view so that Spark SQL queries can be run against the data.
data.createOrReplaceTempView("data_view")

# Select required data into another df
data_view2 = spark.sql(
    'SELECT * FROM data_view')

df=data_view2.write.csv('gs://bucket/folder', header=True, nullValue='')

data_view2.show()

Notice that I have used data_view2.show() to print out the view in order to check if it was correctly read. The output was:

+------+---+
|name  |age|
+------+---+
|Robert| 25|
|null  | 23|
+------+---+

Therefore, the null value was precisely interpreted. In addition, I also checked the .csv file:

name,age
Robert,25
,23

As you can see the null value is correct and not represented as between double quotes as an empty String. Finally, just as a final inspection I created a load job from this .csv file to BigQuery. The table was created and the null value was interpreted accurately.

Note: I ran the pyspark job from the DataProc job's console in a DataProc cluster, previously created. Also, the cluster was at the same location as the dataset in BigQuery.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.