5

I have a pyspark dataframe similar to the following:

df = sql_context.createDataFrame([
  Row(a=3, b=[4,5,6],c=[10,11,12], d='bar', e='utf friendly'),
  Row(a=2, b=[1,2,3],c=[7,8,9], d='foo', e=u'ab\u0000the')
  ])

Where one of the values for column e contains the UTF null character \u0000. If I try to load this df into a postgresql database, I get the following error:

ERROR: invalid byte sequence for encoding "UTF8": 0x00 

which makes sense. How can I efficiently remove the null character from the pyspark dataframe before loading the data into postgres?

I have tried using some of the pyspark.sql.functions to clean the data first without success. encode, decode, and regex_replace did not work:

df.select(regexp_replace(col('e'), u'\u0000', ''))
df.select(encode(col('e'), 'UTF-8'))
df.select(decode(col('e'), 'UTF-8'))

Ideally, I would like to clean the entire dataframe without specifying exactly which columns or what the violating character is, since I don't necessarily know this information ahead of time.

I am using a postgres 9.4.9 database with UTF8 encoding.

2 Answers 2

2

Ah wait - I think I have it. If I do something like this, it seems to work:

null = u'\u0000'
new_df = df.withColumn('e', regexp_replace(df['e'], null, ''))

And then mapping to all string columns:

string_columns = ['d','e']
new_df = df.select(
  *(regexp_replace(col(c), null, '').alias(c) if c in string_columns else c for
    c in df.columns)
  )
Sign up to request clarification or add additional context in comments.

Comments

0

You can use DataFrame.fillna() to replace null values.

Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.

Parameters:

  • value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, or string.

  • subset – optional list of column names to consider. Columns specified in subset that do not have matching data type are ignored. For example, if value is a string, and subset contains a non-string column, then the non-string column is simply ignored.

1 Comment

I don't think that works here, because the problem cell isn't actually null - it contains the UTF null character \u0000. If I run df.fillna() on my example df, it looks like it returns the same dataframe, since none of the cells are actually null. If I try to load the resulting df into a postgres table, I still get the same error message.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.